html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/1385#issuecomment-1043038150,https://api.github.com/repos/pydata/xarray/issues/1385,1043038150,IC_kwDOAMm_X84-K3_G,1197350,2022-02-17T14:57:03Z,2022-02-17T14:57:03Z,MEMBER,See deeper dive in https://github.com/pydata/xarray/discussions/6284,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135 https://github.com/pydata/xarray/issues/1385#issuecomment-1043016100,https://api.github.com/repos/pydata/xarray/issues/1385,1043016100,IC_kwDOAMm_X84-Kymk,1197350,2022-02-17T14:36:23Z,2022-02-17T14:36:23Z,MEMBER,"Ah ok so if that is your goal, `decode_times=False` should be enough to solve it. There is a problem with the time encoding in this file. The units (`days since 1950-01-01T00:00:00Z`) are not compatible with the values (738457.04166667, etc.). That would place your measurements sometime in the year 3971. This is part of the problem, but not the whole story. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135 https://github.com/pydata/xarray/issues/1385#issuecomment-1043001146,https://api.github.com/repos/pydata/xarray/issues/1385,1043001146,IC_kwDOAMm_X84-Ku86,1197350,2022-02-17T14:21:45Z,2022-02-17T14:22:23Z,MEMBER,"> (I could post to a web server if there's any reason to prefer that.) In general that would be a little more convenient than google drive, because then we could download the file from python (rather than having a manual step). This would allow us to share a fully copy-pasteable code snippet to reproduce the issue. But don't worry about that for now. First, I'd note that your issue is not really related to `open_mfdataset` at all, since it is reproduced just using `open_dataset`. The core problem is that you have ~15M timesteps, and it is taking forever to decode the times out of them. It's fast when you do `decode_times=False` because the data aren't actually being read. I'm going to make a post over in [discussions](https://github.com/pydata/xarray/discussions) to dig a bit deeper into this. StackOverflow isn't monitored too regularly by this community.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135 https://github.com/pydata/xarray/issues/1385#issuecomment-1042937825,https://api.github.com/repos/pydata/xarray/issues/1385,1042937825,IC_kwDOAMm_X84-Kffh,1197350,2022-02-17T13:14:50Z,2022-02-17T13:14:50Z,MEMBER,"Hi Tom! 👋 So much has evolved about xarray since this original issue was posted. However, we continue to use it as a catchall for people looking to speed up open_mfdataset. I saw your stackoverflow post. Any chance you could post a link to the actual file in question?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135 https://github.com/pydata/xarray/issues/1385#issuecomment-561920115,https://api.github.com/repos/pydata/xarray/issues/1385,561920115,MDEyOklzc3VlQ29tbWVudDU2MTkyMDExNQ==,1197350,2019-12-05T01:09:25Z,2019-12-05T01:09:25Z,MEMBER,"In your [twitter thread](https://twitter.com/tempestchasing/status/1202367523301318656) you said > Do any of my xarray/dask folks know why open_mfdataset takes such a significant amount of time compared to looping over a list of files? Each file corresponds to a new time, just wanting to open multiple times at once... The general reason for this is usually that `open_mfdataset` performs coordinate compatibility checks when it concatenates the files. It's useful to actually read the code of open_mfdataset to see how it works. First, all the files are opened individually https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L900-L903 You can recreate this step outside of xarray yourself by doing something like ```python from glob import glob datasets = [xr.open_dataset(fname, chunks={}) for fname in glob('*.nc')] ``` Once each dataset is open, xarray calls out to one of its [combine functions](http://xarray.pydata.org/en/latest/combining.html). This logic has gotten more complex over the years as different options have been introduced, but the gist is this: https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L947-L952 You can reproduce this step outside of xarray, e.g. ``` ds = xr.concat(datasets, dim='time') ``` At that point, various checks will kick in to be sure that the coordinates in the different datasets are compatible. Performing these checks requires the data to be read eagerly, which can be a source of slow performance. Without seeing more details about your files, it's hard to know exactly where the issue lies. A good place to start is to simply drop all coordinates from your data as a preprocessing step. ``` def drop_all_coords(ds): return ds.reset_coords(drop=True) xr.open_mfdataset('*.nc', combine='by_coords', preprocess=drop_all_coords) ``` If you observe a big speedup, this points at coordinate compatibility checks as the culprit. From there you can experiment with the various [options](http://xarray.pydata.org/en/latest/generated/xarray.open_mfdataset.html) for `open_mfdataset`, such as `coords='minimal', compat='override'`, etc. Once you post your file details, we can provide more concrete suggestions. ","{""total_count"": 6, ""+1"": 6, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135 https://github.com/pydata/xarray/issues/1385#issuecomment-561915767,https://api.github.com/repos/pydata/xarray/issues/1385,561915767,MDEyOklzc3VlQ29tbWVudDU2MTkxNTc2Nw==,1197350,2019-12-05T00:52:06Z,2019-12-05T00:52:06Z,MEMBER,"@keltonhalbert - I'm sorry you're frustrated by this issue. It's hard to provide a general answer to ""why is open_mfdataset slow?"" without seeing the data in question. I'll try to provide some best practices and recommendations here. In the meantime, could you please post the xarray repr of **two of your files**? To be explicit. ```python ds1 = xr.open_dataset('file1.nc') print(ds1) ds2 = xr.open_dataset('file2.nc') print(ds2) ``` This will help us debug.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135 https://github.com/pydata/xarray/issues/1385#issuecomment-463369751,https://api.github.com/repos/pydata/xarray/issues/1385,463369751,MDEyOklzc3VlQ29tbWVudDQ2MzM2OTc1MQ==,1197350,2019-02-13T21:04:03Z,2019-02-13T21:04:03Z,MEMBER,"What if you do `xr.open_mfdataset(fname, decode_times=False)`?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135 https://github.com/pydata/xarray/issues/1385#issuecomment-371891466,https://api.github.com/repos/pydata/xarray/issues/1385,371891466,MDEyOklzc3VlQ29tbWVudDM3MTg5MTQ2Ng==,1197350,2018-03-09T17:53:15Z,2018-03-09T17:53:15Z,MEMBER,"Calling `ds = xr.decode_cf(ds, decode_times=False)` on the dataset returns instantly. However, the variable data is wrapped in the adaptors, effectively destroying the chunks ```python >>> ds.SST.variable._data LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<_apply_mask, shape=(16401, 2400, 3600), dtype=float32, chunksize=(1, 2400, 3600)>), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))) ``` Calling getitem on this array triggers the whole dask array to be computed, which would takes forever and would completely blow out the notebook memory. This is because of #1372, which would be fixed by #1725. This has actually become a major showstopper for me. I need to work with this dataset in decoded form. Versions
INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 3.12.62-60.64.8-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.1 pandas: 0.22.0 numpy: 1.13.3 scipy: 1.0.0 netCDF4: 1.3.1 h5netcdf: 0.5.0 h5py: 2.7.1 Nio: None zarr: 2.2.0a2.dev176 bottleneck: 1.2.1 cyordereddict: None dask: 0.17.1 distributed: 1.21.3 matplotlib: 2.1.2 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 38.4.0 pip: 9.0.1 conda: None pytest: 3.3.2 IPython: 6.2.1
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135 https://github.com/pydata/xarray/issues/1385#issuecomment-370064483,https://api.github.com/repos/pydata/xarray/issues/1385,370064483,MDEyOklzc3VlQ29tbWVudDM3MDA2NDQ4Mw==,1197350,2018-03-02T21:57:26Z,2018-03-02T21:57:26Z,MEMBER,"An update on this long-standing issue. I have learned that `open_mfdataset` can be blazingly fast if `decode_cf=False` but extremely slow with `decode_cf=True`. As an example, I am loading a POP datataset on cheyenne. Anyone with access can try this example. ```python base_dir = '/glade/scratch/rpa/' prefix = 'BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001' code = 'pop.h.nday1.SST' glob_pattern = os.path.join(base_dir, prefix, '%s.%s.*.nc' % (prefix, code)) def non_time_coords(ds): return [v for v in ds.data_vars if 'time' not in ds[v].dims] def drop_non_essential_vars_pop(ds): return ds.drop(non_time_coords(ds)) # this runs almost instantly ds = xr.open_mfdataset(glob_pattern, decode_times=False, chunks={'time': 1}, preprocess=drop_non_essential_vars_pop, decode_cf=False) ``` And returns this ``` Dimensions: (d2: 2, nlat: 2400, nlon: 3600, time: 16401, z_t: 62, z_t_150m: 15, z_w: 62, z_w_bot: 62, z_w_top: 62) Coordinates: * z_w_top (z_w_top) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 ... * z_t (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ... * z_w (z_w) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ... * z_t_150m (z_t_150m) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ... * z_w_bot (z_w_bot) float32 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ... * time (time) float64 7.322e+05 7.322e+05 7.322e+05 7.322e+05 ... Dimensions without coordinates: d2, nlat, nlon Data variables: time_bound (time, d2) float64 dask.array SST (time, nlat, nlon) float32 dask.array Attributes: nsteps_total: 480 tavg_sum: 64800.0 title: BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001 start_time: This dataset was created on 2016-03-14 at 05:32:30.3 Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf/CF-curren... source: CCSM POP2, the CCSM Ocean Component cell_methods: cell_methods = time: mean ==> the variable values are aver... calendar: All years have exactly 365 days. history: none contents: Diagnostic and Prognostic Variables revision: $Id: tavg.F90 56176 2013-12-20 18:35:46Z mlevy@ucar.edu $ ``` This is roughly 45 years of daily data, one file per year. Instead, if I just change `decode_cf=True` (the default), it takes forever. I can monitor what is happening via the distributed dashboard. It looks like this: ![image](https://user-images.githubusercontent.com/1197350/36923479-71c867d6-1e39-11e8-870d-a044af8af4c8.png) There are more of these `open_dataset` tasks then there are number of files (45), so I can only presume there are 16401 individual tasks (one for each timestep), which each takes about 1 s in serial. This is a real failure of lazy decoding. Maybe it can be fixed by #1725, possibly related to #1372. cc Pangeo folks: @jhamman, @mrocklin ","{""total_count"": 2, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 2, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135 https://github.com/pydata/xarray/issues/1385#issuecomment-297494539,https://api.github.com/repos/pydata/xarray/issues/1385,297494539,MDEyOklzc3VlQ29tbWVudDI5NzQ5NDUzOQ==,1197350,2017-04-26T18:07:03Z,2017-04-26T18:07:03Z,MEMBER,"cc: @geosciz, who is helping with this project.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135