github: issue_comments: 2 rows where author_association = "CONTRIBUTOR" and issue = 1068225524 sorted by updated

2 rows where author_association = "CONTRIBUTOR" and issue = 1068225524 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	performed_via_github_app	issue
1010549000	https://github.com/pydata/xarray/issues/6036#issuecomment-1010549000	https://api.github.com/repos/pydata/xarray/issues/6036	IC_kwDOAMm_X848O8EI	rafa-guedes 7799184	2022-01-12T01:49:52Z	2022-01-12T01:49:52Z	CONTRIBUTOR	Related issue in dask: https://github.com/dask/dask/issues/6363	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		`xarray.open_zarr()` takes too long to lazy load when the data arrays contain a large number of Dask chunks. 1068225524
1005162696	https://github.com/pydata/xarray/issues/6036#issuecomment-1005162696	https://api.github.com/repos/pydata/xarray/issues/6036	IC_kwDOAMm_X8476ZDI	delgadom 3698640	2022-01-04T20:53:36Z	2022-01-04T20:54:13Z	CONTRIBUTOR	This isn't a fix for the overhead required to manage an arbitrarily large graph, but note that creating chunks this small (size 1 in this case) is explicitly not recommended. See the dask docs on Array Best Practices: Select a good chunk size - they recommend chunks no smaller than 100 MB. Your chunks are 8 bytes. This creates 1 billion tasks, which does result in an enormous overhead - there's no way around this. Note that storing this on disk would not help - the problem results from the fact that 1 billion tasks will almost certainly overwhelm any dask scheduler. The general dask best practices guide recommends keeping the number of tasks below 1 million if possible. Also, I don't think that the issue here is in specifying the universe of the tasks that need to be created, but rather in creating and managing the python task objects themselves. So pre-computing or storing them wouldn't help. For me, changing to (1000, 1000, 100) chunks (~750MB for a float64 array) reduces the time to a couple ms: `python In [16]: %%timeit ...: ...: chunks = (1000, 1000, 100) ...: ds = xr.Dataset(data_vars={ ...: "foo": (('x', 'y', 'z'), dask.array.empty((1000, 1000, 1000), chunks=(1000, 1000, 1000)))}) ...: ds.to_zarr(store='data', group='ds.zarr', compute=False, encoding={'foo': {'chunks': chunks}}, mode='w') ...: ds_loaded = xr.open_zarr(group='ds.zarr', store='data') ...: ...: 6.36 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)` With this chunking scheme, you could store and work with much, much more data. In fact, scaling the size of your example by 3 orders of magnitude only increases the runtime by ~5x: `python In [18]: %%timeit ...: ...: chunks = (1000, 1000, 100, 1) ...: ds = xr.Dataset(data_vars={ ...: "foo": (('w', 'x', 'y', 'z'), dask.array.empty((1000, 1000, 1000, 1000), chunks=(1000, 1000, 1000, 1)))}) ...: ds.to_zarr(store='data', group='ds.zarr', compute=False, encoding={'foo': {'chunks': chunks}}, mode='w') ...: ds_loaded = xr.open_zarr(group='ds.zarr', store='data') ...: ...: 36.9 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)` So if re-writing your arrays with larger chunks is an option I think this could get around the problem you're seeing?	{ "total_count": 3, "+1": 3, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		`xarray.open_zarr()` takes too long to lazy load when the data arrays contain a large number of Dask chunks. 1068225524

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);