issues
1 row where user = 23300143 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: created_at (date), updated_at (date), closed_at (date)
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at ▲ | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1068225524 | I_kwDOAMm_X84_q9P0 | 6036 | `xarray.open_zarr()` takes too long to lazy load when the data arrays contain a large number of Dask chunks. | DonjetaR 23300143 | closed | 0 | 10 | 2021-12-01T10:23:43Z | 2023-10-26T17:15:43Z | 2023-10-26T16:42:28Z | NONE | What happened:
The aim is to lazy load a big zarr dataset using In One workaround is to set the parameter We also tried to use Additionally there seems to be 2 memory issues:
- Memory keeps increasing while lazy loading the dataset and it doesn't get freed. May be related to pydata/xarray#6013.
- If the dataset has even smaller chunk size ie. My questions are:
What you expected to happen: I expect to be able to lazy load a large dataset in few milliseconds - including the Dask chunk informations. Minimal Complete Verifiable Example: Following code is run on a JupyterLab Notebook. ```python import dask import snakeviz import xarray as xr %load_ext snakeviz Create example datasetchunks = (5, 5, 5) ds = xr.Dataset(data_vars={ "foo": (('x', 'y', 'z'), dask.array.empty((1000, 1000, 1000), chunks=(1000, 1000, 1000)))}) ds.to_zarr(store='data', group='ds.zarr', compute=False, encoding={'foo': {'chunks': chunks}}) ds_loaded = xr.open_zarr(group='ds.zarr', store='data') %%snakeviz ds_big_loaded = xr.open_zarr(store='open_zarr_test', group='ds') # Runtime: 22 seconds! ```
Snakeviz output
... Anything else we need to know?:
The profiling shows that it is the following code from Dask that is the most time consuming when loading a Zarr dataset. The code is from the ```python from dask.array.slicing import cached_cumsum from itertools import product dim_size = (10, 15_000, 15_000) chunks = dask.array.empty(dim_size, chunks=(10, 10, 10)).chunks cumdims = [cached_cumsum(bds, initial_zero=True) for bds in chunks] slices = [ [slice(s, s + dim) for s, dim in zip(starts, shapes)] for starts, shapes in zip(cumdims, chunks)] slices = list(product(*slices )) # Runtime: 5.59 seconds for oné data array! ``` Environment: Output of <tt>xr.show_versions()</tt>INSTALLED VERSIONS ------------------ commit: None python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-89-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.8.1 xarray: 0.19.0 pandas: 1.3.2 numpy: 1.20.3 scipy: 1.7.1 netCDF4: 1.5.7 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.10.1 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.8 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.08.1 distributed: 2021.08.1 matplotlib: 3.4.2 cartopy: 0.20.0 seaborn: 0.11.2 numbagg: None pint: None setuptools: 58.0.4 pip: 21.2.4 conda: None pytest: 6.2.4 IPython: 7.27.0 sphinx: None |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/6036/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | xarray 13221727 | issue |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issues] ( [id] INTEGER PRIMARY KEY, [node_id] TEXT, [number] INTEGER, [title] TEXT, [user] INTEGER REFERENCES [users]([id]), [state] TEXT, [locked] INTEGER, [assignee] INTEGER REFERENCES [users]([id]), [milestone] INTEGER REFERENCES [milestones]([id]), [comments] INTEGER, [created_at] TEXT, [updated_at] TEXT, [closed_at] TEXT, [author_association] TEXT, [active_lock_reason] TEXT, [draft] INTEGER, [pull_request] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [state_reason] TEXT, [repo] INTEGER REFERENCES [repos]([id]), [type] TEXT ); CREATE INDEX [idx_issues_repo] ON [issues] ([repo]); CREATE INDEX [idx_issues_milestone] ON [issues] ([milestone]); CREATE INDEX [idx_issues_assignee] ON [issues] ([assignee]); CREATE INDEX [idx_issues_user] ON [issues] ([user]);