html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/1823#issuecomment-768627652,https://api.github.com/repos/pydata/xarray/issues/1823,768627652,MDEyOklzc3VlQ29tbWVudDc2ODYyNzY1Mg==,2448579,2021-01-27T22:43:59Z,2021-01-27T22:43:59Z,MEMBER,"That's 34k 3MB files! I suggest combining to 1k 100MB files, that would work a lot better.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220 https://github.com/pydata/xarray/issues/1823#issuecomment-768460310,https://api.github.com/repos/pydata/xarray/issues/1823,768460310,MDEyOklzc3VlQ29tbWVudDc2ODQ2MDMxMA==,2448579,2021-01-27T17:50:09Z,2021-01-27T17:50:09Z,MEMBER,Let's close this since there is an opt-in mostly-fast path. I've added an item to #4648 to cover adding an asv benchmark for mfdataset.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220 https://github.com/pydata/xarray/issues/1823#issuecomment-531913598,https://api.github.com/repos/pydata/xarray/issues/1823,531913598,MDEyOklzc3VlQ29tbWVudDUzMTkxMzU5OA==,2448579,2019-09-16T19:03:47Z,2019-09-16T19:03:47Z,MEMBER,"PS @rabernat ``` %%time ds = xr.open_mfdataset(""/glade/p/cesm/community/ASD-HIGH-RES-CESM1/hybrid_v5_rel04_BC5_ne120_t12_pop62/ocn/proc/tseries/monthly/*.nc"", parallel=True, coords=""minimal"", data_vars=""minimal"", compat='override') ``` This completes in 40 seconds with 10 workers on cheyenne.","{""total_count"": 3, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 2, ""eyes"": 0}",,288184220 https://github.com/pydata/xarray/issues/1823#issuecomment-531912893,https://api.github.com/repos/pydata/xarray/issues/1823,531912893,MDEyOklzc3VlQ29tbWVudDUzMTkxMjg5Mw==,2448579,2019-09-16T19:01:57Z,2019-09-16T19:01:57Z,MEMBER,=) @TomNicholas PRs welcome!,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220 https://github.com/pydata/xarray/issues/1823#issuecomment-531816800,https://api.github.com/repos/pydata/xarray/issues/1823,531816800,MDEyOklzc3VlQ29tbWVudDUzMTgxNjgwMA==,2448579,2019-09-16T15:00:16Z,2019-09-16T15:00:16Z,MEMBER,"YES! (well almost) The PR lets you skip compatibility checks. The magic spell is `xr.open_mfdataset(..., data_vars=""minimal"", coords=""minimal"", compat=""override"")` You can skip index comparison by adding `join=""override""`. Whats left is extremely large indexes and lazy index / coordinate loading but we have #2039 open for that. I will rename that issue. If you have time, can you test it out?","{""total_count"": 2, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 1, ""rocket"": 0, ""eyes"": 0}",,288184220 https://github.com/pydata/xarray/issues/1823#issuecomment-489135792,https://api.github.com/repos/pydata/xarray/issues/1823,489135792,MDEyOklzc3VlQ29tbWVudDQ4OTEzNTc5Mg==,2448579,2019-05-03T15:29:14Z,2019-05-03T15:40:27Z,MEMBER,"One common use-case is files with large numbers of `concat_dim`-invariant non-dimensional co-ordinates. This is easy to speed up by dropping those variables from all but the first file. e.g. https://github.com/pangeo-data/esgf2xarray/blob/6a5e4df0d329c2f23b403cbfbb65f0f1dfa98d52/esgf2zarr/aggregate.py#L107-L110 ``` python # keep only coordinates from first ensemble member to simplify merge first = member_dsets_aligned[0] rest = [mds.reset_coords(drop=True) for mds in member_dsets_aligned[1:]] objs_to_concat = [first] + rest ``` Similarly https://github.com/NCAR/intake-esm/blob/e86a8e8a80ce0fd4198665dbef3ba46af264b5ea/intake_esm/aggregate.py#L53-L57 ``` python def merge_vars_two_datasets(ds1, ds2): """""" Merge two datasets, dropping all variables from second dataset that already exist in the first dataset's coordinates. """""" ``` See also #2039 (second code block) One way to do this might be to add a `master_file` kwarg to `open_mfdataset`. This would imply `coords='minimal', join='exact'` (I think; `prealigned=True` in some other proposals) and would drop non-dimensional coordinates from all but the first file and then call concat. As bonus it would assign attributes from the `master_file` to the merged dataset (for which I think there are open issues) : this functionality exists in `netCDF4.MFDataset` so that's a plus. EDIT: #2039 (third code block) is also a possibility. This might look like ``` python xr.open_mfdataset('files*.nc', master_file='first', concat_dim='time') ``` in which case the first file is read; all coords that are not `concat_dim` become `drop_variables` for an `open_dataset` call that reads the remaining files. We then merge with the first dataset and assign attrs. EDIT2: `master_file` combines two different functionalities here: specifying a ""template file"" and a file to choose attributes from. So maybe we need two kwargs: `template_file` and `attrs_from`?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220 https://github.com/pydata/xarray/issues/1823#issuecomment-488440840,https://api.github.com/repos/pydata/xarray/issues/1823,488440840,MDEyOklzc3VlQ29tbWVudDQ4ODQ0MDg0MA==,2448579,2019-05-01T21:42:01Z,2019-05-01T21:45:38Z,MEMBER,"I am currently motivated to fix this. 1. Over in https://github.com/pydata/xarray/pull/1413#issuecomment-302843502 @rabernat mentioned > allowing the user to pass join='exact' via open_mfdataset. A related optimization would be to allow the user to pass coords='minimal' (or other concat coords options) via open_mfdataset. 2. @shoyer suggested calling decode_cf later here though perhaps this wont help too much: https://github.com/pydata/xarray/issues/1385#issuecomment-439263419 Is this all that we can do on the xarray side?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220