github: issue_comments: 3 rows where author_association = "NONE" and issue = 355264812 sorted by updated

3 rows where author_association = "NONE" and issue = 355264812 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
417252006	https://github.com/pydata/xarray/issues/2389#issuecomment-417252006	https://api.github.com/repos/pydata/xarray/issues/2389	MDEyOklzc3VlQ29tbWVudDQxNzI1MjAwNg==	aseyboldt 1882397	2018-08-30T09:23:20Z	2018-08-30T09:48:40Z	NONE	It seems the xarray object that is sent to the workers contains a reference to the complete graph: ```python vals = da.random.random((5, 1), chunks=(1, 1)) ds = xr.Dataset({'vals': (['a', 'b'], vals)}) write = ds.to_netcdf('file2.nc', compute=False) key = [val for val in write.dask.keys() if isinstance(val, str) and val.startswith('NetCDF')][0] wrapper = write.dask[key] len(pickle.dumps(wrapper)) 14652 delayed_store = wrapper.datastore.delayed_store len(pickle.dumps(delayed_store)) 14652 dask.visualize(delayed_store) ``` The size jumps to the 1.3MB if I use 500 chunks again. The warning about the large object in the graph disappears if we delete that reference before we execute the graph: `key = [val for val in write.dask.keys() if isinstance(val,str) and val.startswith('NetCDF')][0] wrapper = write.dask[key] del wrapper.datastore.delayed_store` It doesn't to change the runtime though.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417242425	https://github.com/pydata/xarray/issues/2389#issuecomment-417242425	https://api.github.com/repos/pydata/xarray/issues/2389	MDEyOklzc3VlQ29tbWVudDQxNzI0MjQyNQ==	aseyboldt 1882397	2018-08-30T08:53:21Z	2018-08-30T08:53:21Z	NONE	Ah, that seems to do the trick. I get about 4.5s for both now, and the time spent pickeling stuff is down to reasonable levels (0.022s). Also the number of function calls dropped from 1e8 to 3e5 :-) There still seems to be some inefficiency in the pickeled graph output, I'm getting a warning about large objects in the graph: ``` /Users/adrianseyboldt/anaconda3/lib/python3.6/site-packages/distributed/worker.py:840: UserWarning: Large object of size 1.31 MB detected in task graph: ('store-03165bae-ac28-11e8-b137-56001c88cd01', <xa ... t 0x316112cc0>) Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers `future = client.submit(func, big_data) # bad big_future = client.scatter(big_data) # good future = client.submit(func, big_future) # good` % (format_bytes(len(b)), s)) ``` The size scales linearly with the number of chunks (it is 13MB if there are 5000 chunks). This doesn't seem to be nearly as problematic as the original issue though. This is after applying both #2391 and #2261.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417060359	https://github.com/pydata/xarray/issues/2389#issuecomment-417060359	https://api.github.com/repos/pydata/xarray/issues/2389	MDEyOklzc3VlQ29tbWVudDQxNzA2MDM1OQ==	aseyboldt 1882397	2018-08-29T18:37:57Z	2018-08-29T18:40:16Z	NONE	pangeo-data/gangeo#266 sounds somewhat similar. If you increase the size of the involved arrays here, you also end up with warnings about the size of the graph: https://stackoverflow.com/questions/52039697/how-to-avoid-large-objects-in-task-graph I haven't tried with #2261 applied, but I can try that tomorrow. If we interpret the time spent in `_thread.lock` as the time the main process is waiting for the workers, then that doesn't seem to be that main problem here. We spend 60s in pickle (almost all the time), and only 7s waiting for locks. I tried looking at the contents of the graph a bit (`write.dask.dicts`) and compared that to the graph of the dataset itself (`ds.vals.data.dask.dicts`). I can't pickle those for some reason (that would be great to see where it is spending all that time), but it looks like those entries the main difference: `( <function dask.array.core.store_chunk(x, out, index, lock, return_stored)>, ( 'stack-6ab3acdaa825862b99d6dbe1c75f0392', 478 ), <xarray.backends.netCDF4_.NetCDF4ArrayWrapper at 0x32fc365c0>, (slice(478, 479, None), ), CombinedLock([<SerializableLock: 0ccceef3-44cd-41ed-947c-f7041ae280c8>, <distributed.lock.Lock object at 0x32fb058d0>]), False),` I don't really know how they work, but maybe pickeling those NetCDF4ArrayWrapper objects is expensive (ie they contain a reference to something they shouldn't)?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

3 rows where author_association = "NONE" and issue = 355264812 sorted by updated_at descending

14652

14652

Advanced export