html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/pull/3545#issuecomment-1362936825,https://api.github.com/repos/pydata/xarray/issues/3545,1362936825,IC_kwDOAMm_X85RPMP5,5821660,2022-12-22T14:56:44Z,2022-12-22T14:56:44Z,MEMBER,"@scottcha @keewis I've tried hard, but finally decided to start from scratch, see #7400. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-1306187168,https://api.github.com/repos/pydata/xarray/issues/3545,1306187168,IC_kwDOAMm_X85N2tWg,14808389,2022-11-07T21:02:35Z,2022-11-09T14:13:24Z,MEMBER,"> I can try to rebase the PR onto latest main I did try that a few months ago, but a lot has changed since the PR was opened so it might actually be easier to reimplement the PR?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-1306229983,https://api.github.com/repos/pydata/xarray/issues/3545,1306229983,IC_kwDOAMm_X85N23zf,5821660,2022-11-07T21:37:11Z,2022-11-07T21:37:11Z,MEMBER,Thanks @keewis for the heads up. I'll have a look and if things get too complicated a reimplementation might be our best option.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-1306157550,https://api.github.com/repos/pydata/xarray/issues/3545,1306157550,IC_kwDOAMm_X85N2mHu,5821660,2022-11-07T20:38:21Z,2022-11-07T20:38:21Z,MEMBER,"Great @scottcha, I was coming back here too every once in an while to just refresh my mind with the ideas pursued here. I can try to rebase the PR onto latest main, if I can free some cycles in the following days for starters.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-1306147080,https://api.github.com/repos/pydata/xarray/issues/3545,1306147080,IC_kwDOAMm_X85N2jkI,5821660,2022-11-07T20:26:57Z,2022-11-07T20:26:57Z,MEMBER,@scottcha Are you still around and interested to bring this along? If not I could try to dive again into this.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-586346637,https://api.github.com/repos/pydata/xarray/issues/3545,586346637,MDEyOklzc3VlQ29tbWVudDU4NjM0NjYzNw==,2448579,2020-02-14T15:52:21Z,2020-02-14T15:52:33Z,MEMBER,"I am now wondering if we can use `align` or `reindex` to do the filling for us. Example: goal is concat along 'x' with result dataset having `x=[1,2,3,4]` 1. Loop through datasets and assign coordinate values as appropriate. 2. Break datasets up into mappings `collected = {""variable"": [var1_at_x=1, var2_at_x=2, var4_at_x=4]}` -> there's some stuff in `merge.py` that could be reused for this 3. concatenate these lists to get a new mapping `concatenated = {""variable"": [var_at_x=[1,2,4]]}` 4. apply `reindexed = {concatenated[var].reindex(x=[1,2,3,4], fill_value=...) for var in concatenated}` 5. create dataset `Dataset(reindexed)` Step 1 would be where we deal with all the edge cases mentioned in @shoyer's comment viz > For example: > > Pre-existing vs non-pre-existing dimension > Pre-existing dimensions of different sizes > Missing data variables vs coordinates vs indexed coordinates ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-579203497,https://api.github.com/repos/pydata/xarray/issues/3545,579203497,MDEyOklzc3VlQ29tbWVudDU3OTIwMzQ5Nw==,5821660,2020-01-28T11:33:41Z,2020-01-28T11:52:02Z,MEMBER,"@scottcha @shoyer below is a minimal example where one variable is missing in each file. ```python import random random.seed(123) random.randint(0, 10) # create var names list with one missing value orig = [f'd{i:02}' for i in range(10)] datasets = [] for i in range(1, 9): l1 = orig.copy() l1.remove(f'd{i:02}') datasets.append(l1) # create files for i, dsl in enumerate(datasets): foo_data = np.arange(24).reshape(2, 3, 4) with nc.Dataset(f'test{i:02}.nc', 'w') as ds: ds.createDimension('x', size=2) ds.createDimension('y', size=3) ds.createDimension('z', size=4) for k in dsl: ds.createVariable(k, int, ('x', 'y', 'z')) ds.variables[k][:] = foo_data flist = glob.glob('test*.nc') dslist = [] for f in flist: dslist.append(xr.open_dataset(f)) ds2 = xr.concat(dslist, dim='time') ds2 ``` Output: ``` Dimensions: (time: 8, x: 2, y: 3, z: 4) Dimensions without coordinates: time, x, y, z Data variables: d01 (x, y, z) int64 0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 d00 (time, x, y, z) int64 0 1 2 3 4 5 6 7 8 ... 16 17 18 19 20 21 22 23 d02 (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0 d03 (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0 d04 (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0 d05 (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0 d06 (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0 d07 (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0 d08 (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... nan nan nan nan nan d09 (time, x, y, z) int64 0 1 2 3 4 5 6 7 8 ... 16 17 18 19 20 21 22 23 ``` Three cases here: - `d00` and `d09` are available in all datasets, and they are concatenated correctly (keeping dtype) - `d02` to `d08` are missing in one dataset and are filled with the created dummy variable, but the dtype is converted to float64 - `d01` is not handled properly, because it is missing in the first dataset, this is due to checking only variables of first dataset in [`_calc_concat_over`](https://github.com/scottcha/xarray/blob/cf5b8bdfb0fdaf626ecf7b83590f15aa9aef1d6b/xarray/core/concat.py#L235) ```python elif opt == ""all"": concat_over.update( set(getattr(datasets[0], subset)) - set(datasets[0].dims) ) ``` and from putting `d01` in [`result_vars`](https://github.com/scottcha/xarray/blob/cf5b8bdfb0fdaf626ecf7b83590f15aa9aef1d6b/xarray/core/concat.py#L329) before iterating to find missing variables. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-577778006,https://api.github.com/repos/pydata/xarray/issues/3545,577778006,MDEyOklzc3VlQ29tbWVudDU3Nzc3ODAwNg==,1217238,2020-01-23T17:10:48Z,2020-01-23T17:10:48Z,MEMBER,Can you explain why you think you need the nested iteration over dataset variables? What ordering are you trying to achieve?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-577745924,https://api.github.com/repos/pydata/xarray/issues/3545,577745924,MDEyOklzc3VlQ29tbWVudDU3Nzc0NTkyNA==,5821660,2020-01-23T15:59:47Z,2020-01-23T15:59:47Z,MEMBER,"@dcherian Just to clarify, the concatenation is done along a new dimension (which has to be created by expand_dims). What do you mean by short-clrcuit in this context? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-577737475,https://api.github.com/repos/pydata/xarray/issues/3545,577737475,MDEyOklzc3VlQ29tbWVudDU3NzczNzQ3NQ==,2448579,2020-01-23T15:41:33Z,2020-01-23T15:41:33Z,MEMBER,"> the most time consuming part is the expand_dims for every dataset, which accounts for roughly 80% overall concat runtime. Hmmm... maybe we need a short-circuit version of `expand_dims`?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-577696741,https://api.github.com/repos/pydata/xarray/issues/3545,577696741,MDEyOklzc3VlQ29tbWVudDU3NzY5Njc0MQ==,5821660,2020-01-23T14:08:57Z,2020-01-23T14:08:57Z,MEMBER,"@scottcha @shoyer For one of my use cases (240 datasets, 1 with missing variables) I do not see any performance penalties using this implementation compared to the current. But this might be due to the fact, that the most time consuming part is the `expand_dims` for every dataset, which accounts for roughly 80% overall concat runtime. If I can be of any help to push this over the line, please ping me.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-577530863,https://api.github.com/repos/pydata/xarray/issues/3545,577530863,MDEyOklzc3VlQ29tbWVudDU3NzUzMDg2Mw==,5821660,2020-01-23T06:52:06Z,2020-01-23T07:29:37Z,MEMBER,"@scottcha If found this while searching. Have the same requirements, means missing DataArrays in some Datasets of a timeseries to be concatenated. I've already some hacks and workarounds in place for my specific use cases, but it would be really great if this could be handled by xarray. I'll try to test your current implementation against my source data and will report my findings here. Update: I've rebased locally on latest master and this works smoothly with my data (which uses packed data). I'll now look into performance.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-569845354,https://api.github.com/repos/pydata/xarray/issues/3545,569845354,MDEyOklzc3VlQ29tbWVudDU2OTg0NTM1NA==,1217238,2019-12-31T01:40:02Z,2019-12-31T01:40:02Z,MEMBER,"I'll take a look at this more carefully soon. But I do think it is a hard requirement that concat runs in linear time (with respect to the total number of variables across all datasets). On Mon, Dec 30, 2019 at 5:18 PM Scott Chamberlin wrote: > Hi, I've provided a new update to this PR (sorry it took me awhile both to > get more familiar with the code and find the time to update the PR). I > improved the logic to be a bit more performant and handle more edge cases > as well as updated the test suite. I have a few questions: > > 1. The tests I wrote are a bit more verbose than the tests previously. > I can tighten them down but I found it was easier for me to read the logic > in this form. Please let me know what you prefer. > 2. I'm still not quite sure I've captured all the scenarios as I'm a > pretty basic xarray user so please let me know if there is still something > I'm missing. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . > ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729 https://github.com/pydata/xarray/pull/3545#issuecomment-555299451,https://api.github.com/repos/pydata/xarray/issues/3545,555299451,MDEyOklzc3VlQ29tbWVudDU1NTI5OTQ1MQ==,1217238,2019-11-19T02:12:45Z,2019-11-19T02:12:45Z,MEMBER,"Thanks for working on this important issue! There are a lot of edge cases that can come up in `concat`, so I think it would be very helpful to try to enumerate a broader set of unit tests for thoroughly testing this. For example: - Pre-existing vs non-pre-existing dimension - Pre-existing dimensions of different sizes - Missing data variables vs coordinates vs indexed coordinates","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,524043729