html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/1385#issuecomment-1043038150,https://api.github.com/repos/pydata/xarray/issues/1385,1043038150,IC_kwDOAMm_X84-K3_G,1197350,2022-02-17T14:57:03Z,2022-02-17T14:57:03Z,MEMBER,See deeper dive in https://github.com/pydata/xarray/discussions/6284,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-1043016100,https://api.github.com/repos/pydata/xarray/issues/1385,1043016100,IC_kwDOAMm_X84-Kymk,1197350,2022-02-17T14:36:23Z,2022-02-17T14:36:23Z,MEMBER,"Ah ok so if that is your goal, `decode_times=False` should be enough to solve it.
There is a problem with the time encoding in this file. The units (`days since 1950-01-01T00:00:00Z`) are not compatible with the values (738457.04166667, etc.). That would place your measurements sometime in the year 3971. This is part of the problem, but not the whole story.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-1043001146,https://api.github.com/repos/pydata/xarray/issues/1385,1043001146,IC_kwDOAMm_X84-Ku86,1197350,2022-02-17T14:21:45Z,2022-02-17T14:22:23Z,MEMBER,"> (I could post to a web server if there's any reason to prefer that.)
In general that would be a little more convenient than google drive, because then we could download the file from python (rather than having a manual step). This would allow us to share a fully copy-pasteable code snippet to reproduce the issue. But don't worry about that for now.
First, I'd note that your issue is not really related to `open_mfdataset` at all, since it is reproduced just using `open_dataset`. The core problem is that you have ~15M timesteps, and it is taking forever to decode the times out of them. It's fast when you do `decode_times=False` because the data aren't actually being read. I'm going to make a post over in [discussions](https://github.com/pydata/xarray/discussions) to dig a bit deeper into this. StackOverflow isn't monitored too regularly by this community.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-1042937825,https://api.github.com/repos/pydata/xarray/issues/1385,1042937825,IC_kwDOAMm_X84-Kffh,1197350,2022-02-17T13:14:50Z,2022-02-17T13:14:50Z,MEMBER,"Hi Tom! 👋
So much has evolved about xarray since this original issue was posted. However, we continue to use it as a catchall for people looking to speed up open_mfdataset. I saw your stackoverflow post. Any chance you could post a link to the actual file in question?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-561920115,https://api.github.com/repos/pydata/xarray/issues/1385,561920115,MDEyOklzc3VlQ29tbWVudDU2MTkyMDExNQ==,1197350,2019-12-05T01:09:25Z,2019-12-05T01:09:25Z,MEMBER,"In your [twitter thread](https://twitter.com/tempestchasing/status/1202367523301318656) you said
> Do any of my xarray/dask folks know why open_mfdataset takes such a significant amount of time compared to looping over a list of files? Each file corresponds to a new time, just wanting to open multiple times at once...
The general reason for this is usually that `open_mfdataset` performs coordinate compatibility checks when it concatenates the files. It's useful to actually read the code of open_mfdataset to see how it works.
First, all the files are opened individually
https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L900-L903
You can recreate this step outside of xarray yourself by doing something like
```python
from glob import glob
datasets = [xr.open_dataset(fname, chunks={}) for fname in glob('*.nc')]
```
Once each dataset is open, xarray calls out to one of its [combine functions](http://xarray.pydata.org/en/latest/combining.html). This logic has gotten more complex over the years as different options have been introduced, but the gist is this:
https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L947-L952
You can reproduce this step outside of xarray, e.g.
```
ds = xr.concat(datasets, dim='time')
```
At that point, various checks will kick in to be sure that the coordinates in the different datasets are compatible. Performing these checks requires the data to be read eagerly, which can be a source of slow performance.
Without seeing more details about your files, it's hard to know exactly where the issue lies. A good place to start is to simply drop all coordinates from your data as a preprocessing step.
```
def drop_all_coords(ds):
return ds.reset_coords(drop=True)
xr.open_mfdataset('*.nc', combine='by_coords', preprocess=drop_all_coords)
```
If you observe a big speedup, this points at coordinate compatibility checks as the culprit. From there you can experiment with the various [options](http://xarray.pydata.org/en/latest/generated/xarray.open_mfdataset.html) for `open_mfdataset`, such as `coords='minimal', compat='override'`, etc.
Once you post your file details, we can provide more concrete suggestions.
","{""total_count"": 6, ""+1"": 6, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-561915767,https://api.github.com/repos/pydata/xarray/issues/1385,561915767,MDEyOklzc3VlQ29tbWVudDU2MTkxNTc2Nw==,1197350,2019-12-05T00:52:06Z,2019-12-05T00:52:06Z,MEMBER,"@keltonhalbert - I'm sorry you're frustrated by this issue. It's hard to provide a general answer to ""why is open_mfdataset slow?"" without seeing the data in question. I'll try to provide some best practices and recommendations here. In the meantime, could you please post the xarray repr of **two of your files**? To be explicit.
```python
ds1 = xr.open_dataset('file1.nc')
print(ds1)
ds2 = xr.open_dataset('file2.nc')
print(ds2)
```
This will help us debug.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-463369751,https://api.github.com/repos/pydata/xarray/issues/1385,463369751,MDEyOklzc3VlQ29tbWVudDQ2MzM2OTc1MQ==,1197350,2019-02-13T21:04:03Z,2019-02-13T21:04:03Z,MEMBER,"What if you do `xr.open_mfdataset(fname, decode_times=False)`?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-371891466,https://api.github.com/repos/pydata/xarray/issues/1385,371891466,MDEyOklzc3VlQ29tbWVudDM3MTg5MTQ2Ng==,1197350,2018-03-09T17:53:15Z,2018-03-09T17:53:15Z,MEMBER,"Calling `ds = xr.decode_cf(ds, decode_times=False)` on the dataset returns instantly. However, the variable data is wrapped in the adaptors, effectively destroying the chunks
```python
>>> ds.SST.variable._data
LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<_apply_mask, shape=(16401, 2400, 3600), dtype=float32, chunksize=(1, 2400, 3600)>), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None))))
```
Calling getitem on this array triggers the whole dask array to be computed, which would takes forever and would completely blow out the notebook memory. This is because of #1372, which would be fixed by #1725.
This has actually become a major showstopper for me. I need to work with this dataset in decoded form.
Versions
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.12.62-60.64.8-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
xarray: 0.10.1
pandas: 0.22.0
numpy: 1.13.3
scipy: 1.0.0
netCDF4: 1.3.1
h5netcdf: 0.5.0
h5py: 2.7.1
Nio: None
zarr: 2.2.0a2.dev176
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.1
distributed: 1.21.3
matplotlib: 2.1.2
cartopy: 0.15.1
seaborn: 0.8.1
setuptools: 38.4.0
pip: 9.0.1
conda: None
pytest: 3.3.2
IPython: 6.2.1
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-370064483,https://api.github.com/repos/pydata/xarray/issues/1385,370064483,MDEyOklzc3VlQ29tbWVudDM3MDA2NDQ4Mw==,1197350,2018-03-02T21:57:26Z,2018-03-02T21:57:26Z,MEMBER,"An update on this long-standing issue.
I have learned that `open_mfdataset` can be blazingly fast if `decode_cf=False` but extremely slow with `decode_cf=True`.
As an example, I am loading a POP datataset on cheyenne. Anyone with access can try this example.
```python
base_dir = '/glade/scratch/rpa/'
prefix = 'BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001'
code = 'pop.h.nday1.SST'
glob_pattern = os.path.join(base_dir, prefix, '%s.%s.*.nc' % (prefix, code))
def non_time_coords(ds):
return [v for v in ds.data_vars
if 'time' not in ds[v].dims]
def drop_non_essential_vars_pop(ds):
return ds.drop(non_time_coords(ds))
# this runs almost instantly
ds = xr.open_mfdataset(glob_pattern, decode_times=False, chunks={'time': 1},
preprocess=drop_non_essential_vars_pop, decode_cf=False)
```
And returns this
```
Dimensions: (d2: 2, nlat: 2400, nlon: 3600, time: 16401, z_t: 62, z_t_150m: 15, z_w: 62, z_w_bot: 62, z_w_top: 62)
Coordinates:
* z_w_top (z_w_top) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 ...
* z_t (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ...
* z_w (z_w) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ...
* z_t_150m (z_t_150m) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ...
* z_w_bot (z_w_bot) float32 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ...
* time (time) float64 7.322e+05 7.322e+05 7.322e+05 7.322e+05 ...
Dimensions without coordinates: d2, nlat, nlon
Data variables:
time_bound (time, d2) float64 dask.array
SST (time, nlat, nlon) float32 dask.array
Attributes:
nsteps_total: 480
tavg_sum: 64800.0
title: BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001
start_time: This dataset was created on 2016-03-14 at 05:32:30.3
Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf/CF-curren...
source: CCSM POP2, the CCSM Ocean Component
cell_methods: cell_methods = time: mean ==> the variable values are aver...
calendar: All years have exactly 365 days.
history: none
contents: Diagnostic and Prognostic Variables
revision: $Id: tavg.F90 56176 2013-12-20 18:35:46Z mlevy@ucar.edu $
```
This is roughly 45 years of daily data, one file per year.
Instead, if I just change `decode_cf=True` (the default), it takes forever. I can monitor what is happening via the distributed dashboard. It looks like this:

There are more of these `open_dataset` tasks then there are number of files (45), so I can only presume there are 16401 individual tasks (one for each timestep), which each takes about 1 s in serial.
This is a real failure of lazy decoding. Maybe it can be fixed by #1725, possibly related to #1372.
cc Pangeo folks: @jhamman, @mrocklin
","{""total_count"": 2, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 2, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-297494539,https://api.github.com/repos/pydata/xarray/issues/1385,297494539,MDEyOklzc3VlQ29tbWVudDI5NzQ5NDUzOQ==,1197350,2017-04-26T18:07:03Z,2017-04-26T18:07:03Z,MEMBER,"cc: @geosciz, who is helping with this project.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135