home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

1 row where repo = 13221727, state = "closed" and user = 23300143 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 1

  • issue 1

state 1

  • closed · 1 ✖

repo 1

  • xarray · 1 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1068225524 I_kwDOAMm_X84_q9P0 6036 `xarray.open_zarr()` takes too long to lazy load when the data arrays contain a large number of Dask chunks. DonjetaR 23300143 closed 0     10 2021-12-01T10:23:43Z 2023-10-26T17:15:43Z 2023-10-26T16:42:28Z NONE      

What happened: The aim is to lazy load a big zarr dataset using xarray.open_zarr() and then compute only the necessary data. However, it takes too long to lazy load a Zarr dataset and if the dataset is large and contains a large number of Dask chunks the memory will increase while lazy loading the dataset and eventually crash because the process runs out of memory.

In xarray.open() the parameter chunks is set to auto, which means that it will load the chunks size and create Dask chunks per datarray in the dataset. The Dask chunks are created in the slices_from_chunks(chunks) function from https://github.com/dask/dask/blob/main/dask/array/core.py. The most time consuming seems to be line 232 where all combinations of the dask chunks are created (see https://github.com/dask/dask/blob/a5aecac8313fea30c5503f534c71f325b1775b9c/dask/array/core.py#L216-L232). In our use case we have millions of small chunks and several data arrays which means that opening the Zarr dataset takes too long. In the example below we create a dataset with 3 dimensions with their respective sizes: {'x': 1000, 'y': 1000, 'z': 1000}, the dataset has 1 data array (foo) and the chunk size is {'x': 5, 'y': 5, 'z': 5}. This results in a runtime of approx. 20 seconds which is not acceptable for lazy loading a dataset.

One workaround is to set the parameter chunks=None in xarray.open_zarr() and we see that we can quickly lazy load the dataset and then proceed by computing only the necessary data in memory in few milliseconds.

We also tried to use xarray.Dataset.chunk() to compute the chunks on the lazy loaded dataset but it still takes too long.

Additionally there seems to be 2 memory issues: - Memory keeps increasing while lazy loading the dataset and it doesn't get freed. May be related to pydata/xarray#6013. - If the dataset has even smaller chunk size ie. {'x': 1, 'y': 1, 'z': 1} the memory keeps increasing until the process crashes.

My questions are:

  • Would it be possible to optimize the code to calculate the chunks in less than a second?
  • Could all combinations of Dask chunks be saved as meta data when saving the Zarr dataset - such that it would be able to load it from disc instead of calculating the Dask chunks.
  • Is there another option to load a Zarr dataset including the information of millions of Dask chunks?

What you expected to happen: I expect to be able to lazy load a large dataset in few milliseconds - including the Dask chunk informations.

Minimal Complete Verifiable Example: Following code is run on a JupyterLab Notebook.

```python import dask import snakeviz import xarray as xr

%load_ext snakeviz

Create example dataset

chunks = (5, 5, 5) ds = xr.Dataset(data_vars={ "foo": (('x', 'y', 'z'), dask.array.empty((1000, 1000, 1000), chunks=(1000, 1000, 1000)))}) ds.to_zarr(store='data', group='ds.zarr', compute=False, encoding={'foo': {'chunks': chunks}}) ds_loaded = xr.open_zarr(group='ds.zarr', store='data') %%snakeviz ds_big_loaded = xr.open_zarr(store='open_zarr_test', group='ds') # Runtime: 22 seconds!

``` Snakeviz output

...

Anything else we need to know?: The profiling shows that it is the following code from Dask that is the most time consuming when loading a Zarr dataset. The code is from the slices_from_chunks(chunks) function in https://github.com/dask/dask/blob/main/dask/array/core.py. The code is run on each data array in the dataset.

```python from dask.array.slicing import cached_cumsum from itertools import product

dim_size = (10, 15_000, 15_000) chunks = dask.array.empty(dim_size, chunks=(10, 10, 10)).chunks

cumdims = [cached_cumsum(bds, initial_zero=True) for bds in chunks] slices = [ [slice(s, s + dim) for s, dim in zip(starts, shapes)] for starts, shapes in zip(cumdims, chunks)] slices = list(product(*slices )) # Runtime: 5.59 seconds for oné data array!

```

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-89-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.8.1 xarray: 0.19.0 pandas: 1.3.2 numpy: 1.20.3 scipy: 1.7.1 netCDF4: 1.5.7 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.10.1 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.8 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.08.1 distributed: 2021.08.1 matplotlib: 3.4.2 cartopy: 0.20.0 seaborn: 0.11.2 numbagg: None pint: None setuptools: 58.0.4 pip: 21.2.4 conda: None pytest: 6.2.4 IPython: 7.27.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6036/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 3680.685ms · About: xarray-datasette