id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1068225524,I_kwDOAMm_X84_q9P0,6036,`xarray.open_zarr()` takes too long to lazy load when the data arrays contain a large number of Dask chunks.,23300143,closed,0,,,10,2021-12-01T10:23:43Z,2023-10-26T17:15:43Z,2023-10-26T16:42:28Z,NONE,,,,"**What happened**: The aim is to lazy load a big zarr dataset using `xarray.open_zarr()` and then compute only the necessary data. However, it takes too long to lazy load a Zarr dataset and if the dataset is large and contains a large number of Dask chunks the memory will increase while lazy loading the dataset and eventually crash because the process runs out of memory. In `xarray.open()` the parameter `chunks` is set to `auto`, which means that it will load the chunks size and create Dask chunks per datarray in the dataset. The Dask chunks are created in the `slices_from_chunks(chunks)` function from `https://github.com/dask/dask/blob/main/dask/array/core.py`. The most time consuming seems to be line 232 where all combinations of the dask chunks are created (see https://github.com/dask/dask/blob/a5aecac8313fea30c5503f534c71f325b1775b9c/dask/array/core.py#L216-L232). In our use case we have millions of small chunks and several data arrays which means that opening the Zarr dataset takes too long. In the example below we create a dataset with 3 dimensions with their respective sizes: `{'x': 1000, 'y': 1000, 'z': 1000}`, the dataset has 1 data array (`foo`) and the chunk size is `{'x': 5, 'y': 5, 'z': 5}`. This results in a runtime of approx. 20 seconds which is not acceptable for lazy loading a dataset. One workaround is to set the parameter `chunks=None` in `xarray.open_zarr()` and we see that we can quickly lazy load the dataset and then proceed by computing only the necessary data in memory in few milliseconds. We also tried to use `xarray.Dataset.chunk()` to compute the chunks on the lazy loaded dataset but it still takes too long. Additionally there seems to be 2 memory issues: - Memory keeps increasing while lazy loading the dataset and it doesn't get freed. May be related to pydata/xarray#6013. - If the dataset has even smaller chunk size ie. `{'x': 1, 'y': 1, 'z': 1}` the memory keeps increasing until the process crashes. My questions are: - Would it be possible to optimize the code to calculate the chunks in less than a second? - Could all combinations of Dask chunks be saved as meta data when saving the Zarr dataset - such that it would be able to load it from disc instead of calculating the Dask chunks. - Is there another option to load a Zarr dataset including the information of millions of Dask chunks? **What you expected to happen**: I expect to be able to lazy load a large dataset in few milliseconds - including the Dask chunk informations. **Minimal Complete Verifiable Example**: Following code is run on a JupyterLab Notebook. ```python import dask import snakeviz import xarray as xr %load_ext snakeviz # Create example dataset chunks = (5, 5, 5) ds = xr.Dataset(data_vars={ ""foo"": (('x', 'y', 'z'), dask.array.empty((1000, 1000, 1000), chunks=(1000, 1000, 1000)))}) ds.to_zarr(store='data', group='ds.zarr', compute=False, encoding={'foo': {'chunks': chunks}}) ds_loaded = xr.open_zarr(group='ds.zarr', store='data') %%snakeviz ds_big_loaded = xr.open_zarr(store='open_zarr_test', group='ds') # Runtime: 22 seconds! ``` **Snakeviz output** ![snakeviz_0](https://user-images.githubusercontent.com/23300143/144213645-0202604a-71a7-474a-b9f3-4848c455bf91.PNG) ... ![snakeviz_1](https://user-images.githubusercontent.com/23300143/144213710-a5f30058-78d4-42cb-a28c-0a97daa627ea.PNG) **Anything else we need to know?**: The profiling shows that it is the following code from Dask that is the most time consuming when loading a Zarr dataset. The code is from the `slices_from_chunks(chunks)` function in `https://github.com/dask/dask/blob/main/dask/array/core.py`. The code is run on each data array in the dataset. ```python from dask.array.slicing import cached_cumsum from itertools import product dim_size = (10, 15_000, 15_000) chunks = dask.array.empty(dim_size, chunks=(10, 10, 10)).chunks cumdims = [cached_cumsum(bds, initial_zero=True) for bds in chunks] slices = [ [slice(s, s + dim) for s, dim in zip(starts, shapes)] for starts, shapes in zip(cumdims, chunks)] slices = list(product(*slices )) # Runtime: 5.59 seconds for oné data array! ``` **Environment**:
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-89-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.8.1 xarray: 0.19.0 pandas: 1.3.2 numpy: 1.20.3 scipy: 1.7.1 netCDF4: 1.5.7 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.10.1 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.8 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.08.1 distributed: 2021.08.1 matplotlib: 3.4.2 cartopy: 0.20.0 seaborn: 0.11.2 numbagg: None pint: None setuptools: 58.0.4 pip: 21.2.4 conda: None pytest: 6.2.4 IPython: 7.27.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6036/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue