home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

4 rows where issue = 1068225524 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • dcherian 1
  • delgadom 1
  • rafa-guedes 1
  • DonjetaR 1

author_association 3

  • CONTRIBUTOR 2
  • MEMBER 1
  • NONE 1

issue 1

  • `xarray.open_zarr()` takes too long to lazy load when the data arrays contain a large number of Dask chunks. · 4 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1010549000 https://github.com/pydata/xarray/issues/6036#issuecomment-1010549000 https://api.github.com/repos/pydata/xarray/issues/6036 IC_kwDOAMm_X848O8EI rafa-guedes 7799184 2022-01-12T01:49:52Z 2022-01-12T01:49:52Z CONTRIBUTOR

Related issue in dask: https://github.com/dask/dask/issues/6363

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `xarray.open_zarr()` takes too long to lazy load when the data arrays contain a large number of Dask chunks. 1068225524
1005162696 https://github.com/pydata/xarray/issues/6036#issuecomment-1005162696 https://api.github.com/repos/pydata/xarray/issues/6036 IC_kwDOAMm_X8476ZDI delgadom 3698640 2022-01-04T20:53:36Z 2022-01-04T20:54:13Z CONTRIBUTOR

This isn't a fix for the overhead required to manage an arbitrarily large graph, but note that creating chunks this small (size 1 in this case) is explicitly not recommended. See the dask docs on Array Best Practices: Select a good chunk size - they recommend chunks no smaller than 100 MB. Your chunks are 8 bytes. This creates 1 billion tasks, which does result in an enormous overhead - there's no way around this. Note that storing this on disk would not help - the problem results from the fact that 1 billion tasks will almost certainly overwhelm any dask scheduler. The general dask best practices guide recommends keeping the number of tasks below 1 million if possible.

Also, I don't think that the issue here is in specifying the universe of the tasks that need to be created, but rather in creating and managing the python task objects themselves. So pre-computing or storing them wouldn't help.

For me, changing to (1000, 1000, 100) chunks (~750MB for a float64 array) reduces the time to a couple ms: python In [16]: %%timeit ...: ...: chunks = (1000, 1000, 100) ...: ds = xr.Dataset(data_vars={ ...: "foo": (('x', 'y', 'z'), dask.array.empty((1000, 1000, 1000), chunks=(1000, 1000, 1000)))}) ...: ds.to_zarr(store='data', group='ds.zarr', compute=False, encoding={'foo': {'chunks': chunks}}, mode='w') ...: ds_loaded = xr.open_zarr(group='ds.zarr', store='data') ...: ...: 6.36 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With this chunking scheme, you could store and work with much, much more data. In fact, scaling the size of your example by 3 orders of magnitude only increases the runtime by ~5x: python In [18]: %%timeit ...: ...: chunks = (1000, 1000, 100, 1) ...: ds = xr.Dataset(data_vars={ ...: "foo": (('w', 'x', 'y', 'z'), dask.array.empty((1000, 1000, 1000, 1000), chunks=(1000, 1000, 1000, 1)))}) ...: ds.to_zarr(store='data', group='ds.zarr', compute=False, encoding={'foo': {'chunks': chunks}}, mode='w') ...: ds_loaded = xr.open_zarr(group='ds.zarr', store='data') ...: ...: 36.9 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) So if re-writing your arrays with larger chunks is an option I think this could get around the problem you're seeing?

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `xarray.open_zarr()` takes too long to lazy load when the data arrays contain a large number of Dask chunks. 1068225524
984014394 https://github.com/pydata/xarray/issues/6036#issuecomment-984014394 https://api.github.com/repos/pydata/xarray/issues/6036 IC_kwDOAMm_X846pt46 DonjetaR 23300143 2021-12-01T20:08:46Z 2021-12-01T20:12:25Z NONE

@dcherian thanks for your reply. I know Xarray can't do anything about the Dask computations of the chunks. My question was if it was possible to save the Dask chunk informations on the Zarr metadata such that it is not neccessary to calculate them ie. run the the getem() function from Dask that takes too long to run and increases memory.

Following example runs out of memory on my computer. I have 16 GB RAM.

```python

import dask import xarray as xr

chunks = (1, 1, 1) ds = xr.Dataset(data_vars={ "foo": (('x', 'y', 'z'), dask.array.empty((1000, 1000, 1000), chunks=(1000, 1000, 1000)))}) ds.to_zarr(store='data', group='ds.zarr', compute=False, encoding={'foo': {'chunks': chunks}}) ds_loaded = xr.open_zarr(group='ds.zarr', store='data')

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `xarray.open_zarr()` takes too long to lazy load when the data arrays contain a large number of Dask chunks. 1068225524
983925525 https://github.com/pydata/xarray/issues/6036#issuecomment-983925525 https://api.github.com/repos/pydata/xarray/issues/6036 IC_kwDOAMm_X846pYMV dcherian 2448579 2021-12-01T18:10:07Z 2021-12-01T18:10:07Z MEMBER

@DonjetaR Thanks for the very well written issue! and for confirming #6013.

Could you please add a minimum reproducible example to #6013? I think that would help greatly

The following runs in 1s for me which seems OK. Can you open an issue over at dask about this. Xarray can't do anything about it. ``` python import dask.array

dim_size = (10, 15_000, 15_000) chunks = dask.array.empty(dim_size, chunks=(10, 10, 10)).chunks

%timeit dask.array.core.slices_from_chunks(chunks) ``` On repeated runs it drops down to 200ms (because of caching I guess), so it was important to restart the kernel to test it out.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `xarray.open_zarr()` takes too long to lazy load when the data arrays contain a large number of Dask chunks. 1068225524

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 479.383ms · About: xarray-datasette