html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/2064#issuecomment-531818131,https://api.github.com/repos/pydata/xarray/issues/2064,531818131,MDEyOklzc3VlQ29tbWVudDUzMTgxODEzMQ==,2448579,2019-09-16T15:03:12Z,2019-09-16T15:03:12Z,MEMBER," #3239 has been merged. Now `minimal` is more useful since you can specify `compat=""override""` to skip compatibility checking. What's left is to change defaults to implement @shoyer's comment > So I'm thinking that we probably want to combine ""all"" and ""minimal"" into a single mode to use as the default, and remove the other behavior, which is either useless or broken. Maybe it would make sense to come up with a new name for this mode, and to make both ""all"" and ""minimal"" deprecated aliases for it? In the long term, this leaves only two ""automatic"" modes for xarray.concat, which should make things simpler for users trying to figure this out.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-524021001,https://api.github.com/repos/pydata/xarray/issues/2064,524021001,MDEyOklzc3VlQ29tbWVudDUyNDAyMTAwMQ==,2448579,2019-08-22T18:22:37Z,2019-08-22T18:22:37Z,MEMBER,"Thanks for your input @bonnland. > The pandas concat() function uses the option join = {'inner', 'outer', 'left', 'right'} in order to mimic logical database join operations. If there is a reason that xarray cannot do the same, it is not obvious to me. I think the pandas options have the advantage of logical simplicity and traditional usage within database systems. We do have a `join` argument that takes these arguments + 'override' which was added recently to skip expensive comparisons. This works for ""indexes"" or ""dimension coordinates"". An example: if you have 2 dataarrays, one on a coordinate `x=[1, 2, 3]` and the other on `x=[2,3,4]`, `join` lets you control the `x` coordinate of the output. This is done by `xr.align`. What's under discussion here is what to do about variables duplicated across datasets or indeed, how do we know that these variables are duplicated across datasets when concatenating other variables. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-523960862,https://api.github.com/repos/pydata/xarray/issues/2064,523960862,MDEyOklzc3VlQ29tbWVudDUyMzk2MDg2Mg==,2448579,2019-08-22T15:42:10Z,2019-08-22T15:42:10Z,MEMBER,"I have a draft solution in #3239. It adds a new mode called ""sensible"" that acts like ""all"" when the concat dimension doesn't exist in the dataset and acts like ""minimal"" when the dimension is present. We can decide whether this is the right way i.e. add a new mode but the more fundamental problem is below. The issue is dealing with variables that should not be concatentated in ""minimal"" mode (e.g. time-invariant non dim coords when concatenating in time). In this case, we want to skip the equality checks in `_calc_concat_over`. This is a common reason for poor `open_mfdataset` performance. I thought the clean way to do this would be to add the `compat` kwarg to `concat` and then add `compat='override'` since the current behaviour is effectively `compat='equals'`. However, `merge` takes `compat` too and `concat` and `merge` support different `compat` arguments at present. This makes it complicated to easily thread `compat` down from `combine` or `open_mfdataset` without adding `concat_compat` and `merge_compat` which is silly. So do we want to support all the other `compat` modes in `concat`? Things like `broadcast_equals` or `no_conflicts` are funny because they're basically `merge` operations and it means `concat` acts like both `stack`, `concat` and `merge`. OTOH if you have a set of variables with the same name from different datasets and you want to pick one of those (i.e. no concatenation), then you're basically doing `merge` anyway. This would require some refactoring since `concat` assumes the first dataset is a template for the rest. @shoyer What do you think? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-519149757,https://api.github.com/repos/pydata/xarray/issues/2064,519149757,MDEyOklzc3VlQ29tbWVudDUxOTE0OTc1Nw==,2448579,2019-08-07T15:32:16Z,2019-08-07T15:32:16Z,MEMBER,"> Maybe it would make sense to come up with a new name for this mode, and to make both ""all"" and ""minimal"" deprecated aliases for it? I'm in favour of this. What should we name this mode? One comment on ""existing dimensions"" mode: > - ""minimal"" does the right thing, concatenating only variables with the dimension. For variables without the dimension, this will still raise a `ValueError` because `compat` can only be `'equals'` or `'identical'`. It seems to me like we need `compat='override'` and/or `compat='tolerance', tolerance=...` that would use numpy's approximate equality testing. This checking of non-dimensional coordinates is a common source of `mfdataset` issues. What do you think? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-512036050,https://api.github.com/repos/pydata/xarray/issues/2064,512036050,MDEyOklzc3VlQ29tbWVudDUxMjAzNjA1MA==,1217238,2019-07-16T23:09:24Z,2019-07-16T23:09:24Z,MEMBER,"> UPDATE: @shoyer it could be that unit tests are failing because, as your final example shows, you get an error for data_vars='minimal' if _any_ variables have different values across datasets, when adding a new concatentation dimension. If this is the reason so many unit tests are failing, then the failures are a red herring and should probably be ignored/rewritten. This seems very likely to me. The existing behavior of `data_vars='minimal'` is only useful in ""existing dimensions mode"". Xarray's unit test suite is definitely a good ""smoke test"" for understanding the impact of changes to `concat` on our users. What it tells us is that we can't change the default value from `""all""` to `""minimal""` without breaking existing code. Instead, we need to change how ""all"" or ""minimal"" works, or switch to yet another mode for the new behavior. The tests we should feel free to rewrite are cases where we set `data_vars=""all""` or `data_vars=""minimal""` explicitly for verifying the weird edge behaviors that I noted in my earlier comments. There shouldn't be too many of these tests.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-512000102,https://api.github.com/repos/pydata/xarray/issues/2064,512000102,MDEyOklzc3VlQ29tbWVudDUxMjAwMDEwMg==,1217238,2019-07-16T21:44:52Z,2019-07-16T21:44:52Z,MEMBER,"> Specifically, what should the default behavior of concat() be, when both datasets include a variable that does not include the concatenation dimension? Currently, the concat dimension is added, and the result is a ""stacked"" version of the variable. Others have argued that this variable should not be included in the concat() result by default, but this appears to break compatibility with Pandas concat(). Can you give a specific example of the behavior in question?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-511611430,https://api.github.com/repos/pydata/xarray/issues/2064,511611430,MDEyOklzc3VlQ29tbWVudDUxMTYxMTQzMA==,1217238,2019-07-15T23:54:47Z,2019-07-15T23:54:47Z,MEMBER,"The logic for determining which variables to concatenate is in the `_calc_concat_over` helper function: https://github.com/pydata/xarray/blob/539fb4a98d0961c281daa5474a8e492a0ae1d8a2/xarray/core/concat.py#L146 Only `""different""` is supposed to load variables into memory to determine which ones to concatenate. Right now we also have `""all""` and `""minimal""` options: - `""all""` attempts to concatenate *every* variable that can be broadcast to a matching shape: https://github.com/pydata/xarray/blob/539fb4a98d0961c281daa5474a8e492a0ae1d8a2/xarray/core/concat.py#L188-L190 - `""minimal""` only concatenates variables that already have the matching dimension. Recall that `concat` handles two types of concatenation: existing dimensions (corresponding to `np.concatenate`) and new dimensions (corresponding to `np.stack`). Currently, this is all done together in one messy codebase, but logically it would be cleaner to separate these modes into two separate function: - In ""existing dimensions"" mode: - `""all""` is currently broken, because it will also concatenate variables that don't have the dimension. - `""minimal""` does the right thing, concatenating only variables with the dimension. - In ""new dimensions"" mode: - `""all""` will add the dimension to all variables. - `""minimal""` raise an error if *any* variables have different values. If you're datasets have any data variables with different values at all, it raises an error. This is pretty much useless. Here's my notebook testing this out: https://gist.github.com/shoyer/f44300eddda4f7c476c61f76d1df938b So I'm thinking that we probably want to combine ""all"" and ""minimal"" into a single mode to use as the default, and remove the other behavior, which is either useless or broken. Maybe it would make sense to come up with a new name for this mode, and to make both `""all""` and `""minimal""` deprecated aliases for it? In the long term, this leaves only two ""automatic"" modes for `xarray.concat`, which should make things simpler for users trying to figure this out.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-511468454,https://api.github.com/repos/pydata/xarray/issues/2064,511468454,MDEyOklzc3VlQ29tbWVudDUxMTQ2ODQ1NA==,2448579,2019-07-15T16:15:51Z,2019-07-15T16:15:51Z,MEMBER,"@bonnland I don't think you want to change the default `data_vars` but instead update the heuristics as in this comment > we shouldn't implicitly add a new dimensions to variables in the case where the dimension already exists in the dataset. We only need the heuristics/comparisons when an entirely new dimension is being added.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-381975937,https://api.github.com/repos/pydata/xarray/issues/2064,381975937,MDEyOklzc3VlQ29tbWVudDM4MTk3NTkzNw==,1197350,2018-04-17T12:34:15Z,2018-04-17T12:34:15Z,MEMBER,"I'm glad! FWIW, I think this is a relatively simple fix within xarray. @xylar, if you are game, we would love to see a PR from you. Could be a good opportunity to learn more about xarray internals.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-381728814,https://api.github.com/repos/pydata/xarray/issues/2064,381728814,MDEyOklzc3VlQ29tbWVudDM4MTcyODgxNA==,1217238,2018-04-16T19:55:24Z,2018-04-16T19:55:24Z,MEMBER,"> I stand corrected. in 0.10.1, I also see the Time variable getting added to refBottomDepth when I open multiple files. So maybe this is not in fact a new problem but an existing issue that happened to behave as I expected only when opening a single file in previous versions. Sorry for not noticing that sooner. OK, in that case I think #2048 was still the right change/bug-fix, making multi-file and single-file behavior consistent. But you certainly have exposed a real issue here. > But this issue raises an important basic point: we might want different behavior for variables in which concat_dim is already a dimension vs. variables for which it is not. Yes, we shouldn't implicitly add a new dimensions to variables in the case where the dimension already exists in the dataset. We only need the heuristics/comparisons when an entirely *new* dimension is being added.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-381725478,https://api.github.com/repos/pydata/xarray/issues/2064,381725478,MDEyOklzc3VlQ29tbWVudDM4MTcyNTQ3OA==,1197350,2018-04-16T19:44:00Z,2018-04-16T19:44:00Z,MEMBER,But this issue raises an important basic point: we might want different behavior for variables in which `concat_dim` is already a dimension vs. variables for which it is not.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-381722944,https://api.github.com/repos/pydata/xarray/issues/2064,381722944,MDEyOklzc3VlQ29tbWVudDM4MTcyMjk0NA==,1197350,2018-04-16T19:35:12Z,2018-04-16T19:35:12Z,MEMBER,"> so you're fooling xarray into not including the time dimension in your non-time variables by making them coordinates in the above example? Exactly. They *are* coordinates. Those variables are usually related to grid geometry or constants, as I presume is `refBottomDepth` in your example. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-381717472,https://api.github.com/repos/pydata/xarray/issues/2064,381717472,MDEyOklzc3VlQ29tbWVudDM4MTcxNzQ3Mg==,1197350,2018-04-16T19:15:19Z,2018-04-16T19:15:19Z,MEMBER,"👍 This is a persistent problem for me as well. I often find myself writing a preprocessor function like this ```python def process_coords(ds, concat_dim='time', drop=True): coord_vars = [v for v in ds.data_vars if concat_dim not in ds[v].dims] if drop: return ds.drop(coord_vars) else: return ds.set_coords(coord_vars) ds = xr.open_mfdataset('*.nc', preprocess=process_coords) ``` The reason to drop the coordinates is to avoid the comparison that happens when you concatenate coords.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258 https://github.com/pydata/xarray/issues/2064#issuecomment-381707540,https://api.github.com/repos/pydata/xarray/issues/2064,381707540,MDEyOklzc3VlQ29tbWVudDM4MTcwNzU0MA==,1217238,2018-04-16T18:42:06Z,2018-04-16T18:42:06Z,MEMBER,"What happens if you open multiple files with `open_mfdataset()`, e.g., for both January and February. Does it result in a dataset with the right dimensions on each variable?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,314764258