issue_comments
3 rows where author_association = "NONE" and issue = 355264812 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: created_at (date), updated_at (date)
issue 1
- Large pickle overhead in ds.to_netcdf() involving dask.delayed functions · 3 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
417252006 | https://github.com/pydata/xarray/issues/2389#issuecomment-417252006 | https://api.github.com/repos/pydata/xarray/issues/2389 | MDEyOklzc3VlQ29tbWVudDQxNzI1MjAwNg== | aseyboldt 1882397 | 2018-08-30T09:23:20Z | 2018-08-30T09:48:40Z | NONE | It seems the xarray object that is sent to the workers contains a reference to the complete graph: ```python vals = da.random.random((5, 1), chunks=(1, 1)) ds = xr.Dataset({'vals': (['a', 'b'], vals)}) write = ds.to_netcdf('file2.nc', compute=False) key = [val for val in write.dask.keys() if isinstance(val, str) and val.startswith('NetCDF')][0] wrapper = write.dask[key] len(pickle.dumps(wrapper)) 14652delayed_store = wrapper.datastore.delayed_store len(pickle.dumps(delayed_store)) 14652dask.visualize(delayed_store) ``` The size jumps to the 1.3MB if I use 500 chunks again. The warning about the large object in the graph disappears if we delete that reference before we execute the graph:
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812 | |
417242425 | https://github.com/pydata/xarray/issues/2389#issuecomment-417242425 | https://api.github.com/repos/pydata/xarray/issues/2389 | MDEyOklzc3VlQ29tbWVudDQxNzI0MjQyNQ== | aseyboldt 1882397 | 2018-08-30T08:53:21Z | 2018-08-30T08:53:21Z | NONE | Ah, that seems to do the trick. I get about 4.5s for both now, and the time spent pickeling stuff is down to reasonable levels (0.022s). Also the number of function calls dropped from 1e8 to 3e5 :-) There still seems to be some inefficiency in the pickeled graph output, I'm getting a warning about large objects in the graph: ``` /Users/adrianseyboldt/anaconda3/lib/python3.6/site-packages/distributed/worker.py:840: UserWarning: Large object of size 1.31 MB detected in task graph: ('store-03165bae-ac28-11e8-b137-56001c88cd01', <xa ... t 0x316112cc0>) Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers
% (format_bytes(len(b)), s)) ``` The size scales linearly with the number of chunks (it is 13MB if there are 5000 chunks). This doesn't seem to be nearly as problematic as the original issue though. This is after applying both #2391 and #2261. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812 | |
417060359 | https://github.com/pydata/xarray/issues/2389#issuecomment-417060359 | https://api.github.com/repos/pydata/xarray/issues/2389 | MDEyOklzc3VlQ29tbWVudDQxNzA2MDM1OQ== | aseyboldt 1882397 | 2018-08-29T18:37:57Z | 2018-08-29T18:40:16Z | NONE | pangeo-data/gangeo#266 sounds somewhat similar. If you increase the size of the involved arrays here, you also end up with warnings about the size of the graph: https://stackoverflow.com/questions/52039697/how-to-avoid-large-objects-in-task-graph I haven't tried with #2261 applied, but I can try that tomorrow. If we interpret the time spent in |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 1