html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/1823#issuecomment-768627652,https://api.github.com/repos/pydata/xarray/issues/1823,768627652,MDEyOklzc3VlQ29tbWVudDc2ODYyNzY1Mg==,2448579,2021-01-27T22:43:59Z,2021-01-27T22:43:59Z,MEMBER,"That's 34k 3MB files! I suggest combining to 1k 100MB files, that would work a lot better.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-768460310,https://api.github.com/repos/pydata/xarray/issues/1823,768460310,MDEyOklzc3VlQ29tbWVudDc2ODQ2MDMxMA==,2448579,2021-01-27T17:50:09Z,2021-01-27T17:50:09Z,MEMBER,Let's close this since there is an opt-in mostly-fast path. I've added an item to #4648 to cover adding an asv benchmark for mfdataset.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-531913598,https://api.github.com/repos/pydata/xarray/issues/1823,531913598,MDEyOklzc3VlQ29tbWVudDUzMTkxMzU5OA==,2448579,2019-09-16T19:03:47Z,2019-09-16T19:03:47Z,MEMBER,"PS @rabernat 

```
%%time
ds = xr.open_mfdataset(""/glade/p/cesm/community/ASD-HIGH-RES-CESM1/hybrid_v5_rel04_BC5_ne120_t12_pop62/ocn/proc/tseries/monthly/*.nc"", 
                        parallel=True, coords=""minimal"", data_vars=""minimal"", compat='override')
```
This completes in 40 seconds with 10 workers on cheyenne.","{""total_count"": 3, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 2, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-531912893,https://api.github.com/repos/pydata/xarray/issues/1823,531912893,MDEyOklzc3VlQ29tbWVudDUzMTkxMjg5Mw==,2448579,2019-09-16T19:01:57Z,2019-09-16T19:01:57Z,MEMBER,=) @TomNicholas PRs welcome!,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-531905844,https://api.github.com/repos/pydata/xarray/issues/1823,531905844,MDEyOklzc3VlQ29tbWVudDUzMTkwNTg0NA==,35968931,2019-09-16T18:43:52Z,2019-09-16T18:43:52Z,MEMBER,"This is big if true!

But surely to close an issue raised by complaints about speed, we should really have some new asv speed tests?","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-531816800,https://api.github.com/repos/pydata/xarray/issues/1823,531816800,MDEyOklzc3VlQ29tbWVudDUzMTgxNjgwMA==,2448579,2019-09-16T15:00:16Z,2019-09-16T15:00:16Z,MEMBER,"YES!
(well almost)

The PR lets you skip compatibility checks. 
The magic spell is `xr.open_mfdataset(..., data_vars=""minimal"", coords=""minimal"", compat=""override"")`
You can skip index comparison by adding `join=""override""`. 

Whats left is extremely large indexes and lazy index / coordinate loading but we have #2039 open for that. I will rename that issue.

If you have time, can you test it out?","{""total_count"": 2, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 1, ""rocket"": 0, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-531813935,https://api.github.com/repos/pydata/xarray/issues/1823,531813935,MDEyOklzc3VlQ29tbWVudDUzMTgxMzkzNQ==,1197350,2019-09-16T14:53:57Z,2019-09-16T14:53:57Z,MEMBER,"Is this issue really closed?!?

🎉🎂🏆🥇","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-489135792,https://api.github.com/repos/pydata/xarray/issues/1823,489135792,MDEyOklzc3VlQ29tbWVudDQ4OTEzNTc5Mg==,2448579,2019-05-03T15:29:14Z,2019-05-03T15:40:27Z,MEMBER,"One common use-case is files with large numbers of `concat_dim`-invariant non-dimensional co-ordinates.  This is easy to speed up by dropping those variables from all but the first file. 

e.g.
https://github.com/pangeo-data/esgf2xarray/blob/6a5e4df0d329c2f23b403cbfbb65f0f1dfa98d52/esgf2zarr/aggregate.py#L107-L110
``` python
    # keep only coordinates from first ensemble member to simplify merge
    first = member_dsets_aligned[0]
    rest = [mds.reset_coords(drop=True) for mds in member_dsets_aligned[1:]]
    objs_to_concat = [first] + rest
```

Similarly https://github.com/NCAR/intake-esm/blob/e86a8e8a80ce0fd4198665dbef3ba46af264b5ea/intake_esm/aggregate.py#L53-L57

``` python
def merge_vars_two_datasets(ds1, ds2):
    """"""
    Merge two datasets, dropping all variables from
    second dataset that already exist in the first dataset's coordinates.
    """"""
```

See also #2039 (second code block)

One way to do this might be to add a `master_file` kwarg to `open_mfdataset`. This would imply `coords='minimal', join='exact'` (I think; `prealigned=True` in some other proposals) and would drop non-dimensional coordinates from all but the first file and then call concat. 

As bonus it would assign attributes from the `master_file` to the merged dataset (for which I think there are open issues) : this functionality exists in `netCDF4.MFDataset` so that's a plus.

EDIT: #2039 (third code block) is also a possibility. This might look like
``` python
xr.open_mfdataset('files*.nc', master_file='first', concat_dim='time')
```
in which case the first file is read; all coords that are not `concat_dim` become `drop_variables` for an `open_dataset` call that reads the remaining files. We then merge with the first dataset and assign attrs.

EDIT2: `master_file` combines two different functionalities here: specifying a ""template file"" and a file to choose attributes from. So maybe we need two kwargs: `template_file` and `attrs_from`?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-489101053,https://api.github.com/repos/pydata/xarray/issues/1823,489101053,MDEyOklzc3VlQ29tbWVudDQ4OTEwMTA1Mw==,1197350,2019-05-03T13:47:12Z,2019-05-03T13:47:12Z,MEMBER,"So I think it is quite important to consider this issue together with #2697. An xml specification called [NCML](https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/) already exists which tells software how to put together multiple netCDF files into a single virtual netcdf. We should leverage this existing spec as much as possible.

A realistic use case for me is that I have, say 1000 files of high-res model output, each with large coordinate variables, all generated from the same model run. If we want to  for for which we *know a priori* that certain coordinates (dimension coordinates or otherwise) are identical, we could save a lot of disk reads (the slow part of `open_mfdataset`) by never reading those coordinates at all. Enabling this would require a pretty low-level change in xarray. For example, we couldn't even rely on `open_dataset` in its current form to open files, because `open_dataset` eagerly loads all dimension coordinates into indexes. One way forward might be to create a new Store class.

For a catalog of tricks I use to optimize opening these sorts of big, complex, multi-file datasets (e.g. CMIP), check out
https://github.com/pangeo-data/esgf2xarray/blob/master/esgf2zarr/aggregate.py
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-489027263,https://api.github.com/repos/pydata/xarray/issues/1823,489027263,MDEyOklzc3VlQ29tbWVudDQ4OTAyNzI2Mw==,35968931,2019-05-03T09:25:00Z,2019-05-03T09:25:00Z,MEMBER,"@dcherian I'm sorry, I'm very interested in this but after reading the issues I'm still not clear on what's being proposed:

What exactly is the bottleneck? Is it reading the coords from all the files? Is it loading the coord values into memory? Is it performing the alignment checks on those coords once they're in memory? Is it performing alignment checks on the dimensions? Is this suggestion relevant to datasets that don't have any coords?

Which of these steps would a `join='exact'` option omit?

> A related optimization would be to allow the user to pass coords='minimal' (or other concat coords options) via open_mfdataset.

But this is already an option to `open_mfdataset`?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-488440840,https://api.github.com/repos/pydata/xarray/issues/1823,488440840,MDEyOklzc3VlQ29tbWVudDQ4ODQ0MDg0MA==,2448579,2019-05-01T21:42:01Z,2019-05-01T21:45:38Z,MEMBER,"I am currently motivated to fix this.

1. Over in https://github.com/pydata/xarray/pull/1413#issuecomment-302843502 @rabernat mentioned
> allowing the user to pass join='exact' via open_mfdataset. A related optimization would be to allow the user to pass coords='minimal' (or other concat coords options) via open_mfdataset.

2. @shoyer suggested calling decode_cf later here though perhaps this wont help too much: https://github.com/pydata/xarray/issues/1385#issuecomment-439263419

Is this all that we can do on the xarray side?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-372862174,https://api.github.com/repos/pydata/xarray/issues/1823,372862174,MDEyOklzc3VlQ29tbWVudDM3Mjg2MjE3NA==,2443309,2018-03-14T00:13:34Z,2018-03-14T00:13:34Z,MEMBER,"@jbusecke - No. These options are not mutually exclusive. The parallel open is, in my opinion, the lowest hanging fruit so that's why I started there. There are other improvements that we can tackle incrementally. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220
https://github.com/pydata/xarray/issues/1823#issuecomment-357336022,https://api.github.com/repos/pydata/xarray/issues/1823,357336022,MDEyOklzc3VlQ29tbWVudDM1NzMzNjAyMg==,2443309,2018-01-12T19:46:12Z,2018-01-12T19:46:12Z,MEMBER,"@rabernat - Depending on the structure of the dataset, another possibility that would speed up some `open_mfdataset` tasks substantially is to implement the step of opening each file and getting its metadata in in some parallel way (dask/joblib/etc.) and either returning the just dataset schema or a picklable version of the dataset itself.  I think this will only be able to work with `autoclose=True` but it could be quite useful when working with many files. ","{""total_count"": 3, ""+1"": 3, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,288184220