github: issues: 1 row where repo = 13221727, state = "closed" and user = 23300143 sorted by updated

1 row where repo = 13221727, state = "closed" and user = 23300143 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at ▲	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1068225524	I_kwDOAMm_X84_q9P0	6036	`xarray.open_zarr()` takes too long to lazy load when the data arrays contain a large number of Dask chunks.	DonjetaR 23300143	closed	0			10	2021-12-01T10:23:43Z	2023-10-26T17:15:43Z	2023-10-26T16:42:28Z	NONE				What happened: The aim is to lazy load a big zarr dataset using `xarray.open_zarr()` and then compute only the necessary data. However, it takes too long to lazy load a Zarr dataset and if the dataset is large and contains a large number of Dask chunks the memory will increase while lazy loading the dataset and eventually crash because the process runs out of memory. In `xarray.open()` the parameter `chunks` is set to `auto`, which means that it will load the chunks size and create Dask chunks per datarray in the dataset. The Dask chunks are created in the `slices_from_chunks(chunks)` function from `https://github.com/dask/dask/blob/main/dask/array/core.py`. The most time consuming seems to be line 232 where all combinations of the dask chunks are created (see https://github.com/dask/dask/blob/a5aecac8313fea30c5503f534c71f325b1775b9c/dask/array/core.py#L216-L232). In our use case we have millions of small chunks and several data arrays which means that opening the Zarr dataset takes too long. In the example below we create a dataset with 3 dimensions with their respective sizes: `{'x': 1000, 'y': 1000, 'z': 1000}`, the dataset has 1 data array (`foo`) and the chunk size is `{'x': 5, 'y': 5, 'z': 5}`. This results in a runtime of approx. 20 seconds which is not acceptable for lazy loading a dataset. One workaround is to set the parameter `chunks=None` in `xarray.open_zarr()` and we see that we can quickly lazy load the dataset and then proceed by computing only the necessary data in memory in few milliseconds. We also tried to use `xarray.Dataset.chunk()` to compute the chunks on the lazy loaded dataset but it still takes too long. Additionally there seems to be 2 memory issues: - Memory keeps increasing while lazy loading the dataset and it doesn't get freed. May be related to pydata/xarray#6013. - If the dataset has even smaller chunk size ie. `{'x': 1, 'y': 1, 'z': 1}` the memory keeps increasing until the process crashes. My questions are: Would it be possible to optimize the code to calculate the chunks in less than a second? Could all combinations of Dask chunks be saved as meta data when saving the Zarr dataset - such that it would be able to load it from disc instead of calculating the Dask chunks. Is there another option to load a Zarr dataset including the information of millions of Dask chunks? What you expected to happen: I expect to be able to lazy load a large dataset in few milliseconds - including the Dask chunk informations. Minimal Complete Verifiable Example: Following code is run on a JupyterLab Notebook. ```python import dask import snakeviz import xarray as xr %load_ext snakeviz Create example dataset chunks = (5, 5, 5) ds = xr.Dataset(data_vars={ "foo": (('x', 'y', 'z'), dask.array.empty((1000, 1000, 1000), chunks=(1000, 1000, 1000)))}) ds.to_zarr(store='data', group='ds.zarr', compute=False, encoding={'foo': {'chunks': chunks}}) ds_loaded = xr.open_zarr(group='ds.zarr', store='data') %%snakeviz ds_big_loaded = xr.open_zarr(store='open_zarr_test', group='ds') # Runtime: 22 seconds! ``` Snakeviz output ... Anything else we need to know?: The profiling shows that it is the following code from Dask that is the most time consuming when loading a Zarr dataset. The code is from the `slices_from_chunks(chunks)` function in `https://github.com/dask/dask/blob/main/dask/array/core.py`. The code is run on each data array in the dataset. ```python from dask.array.slicing import cached_cumsum from itertools import product dim_size = (10, 15_000, 15_000) chunks = dask.array.empty(dim_size, chunks=(10, 10, 10)).chunks cumdims = [cached_cumsum(bds, initial_zero=True) for bds in chunks] slices = [ [slice(s, s + dim) for s, dim in zip(starts, shapes)] for starts, shapes in zip(cumdims, chunks)] slices = list(product(slices )) # Runtime: 5.59 seconds for oné data array! ``` Environment*: Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-89-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.8.1 xarray: 0.19.0 pandas: 1.3.2 numpy: 1.20.3 scipy: 1.7.1 netCDF4: 1.5.7 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.10.1 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.8 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.08.1 distributed: 2021.08.1 matplotlib: 3.4.2 cartopy: 0.20.0 seaborn: 0.11.2 numbagg: None pint: None setuptools: 58.0.4 pip: 21.2.4 conda: None pytest: 6.2.4 IPython: 7.27.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6036/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

1 row where repo = 13221727, state = "closed" and user = 23300143 sorted by updated_at descending

Create example dataset

Advanced export