github: issue_comments: 6 rows where issue = 355264812 and user = 1217238 sorted by updated

6 rows where issue = 355264812 and user = 1217238 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
419218306	https://github.com/pydata/xarray/issues/2389#issuecomment-419218306	https://api.github.com/repos/pydata/xarray/issues/2389	MDEyOklzc3VlQ29tbWVudDQxOTIxODMwNg==	shoyer 1217238	2018-09-06T19:46:03Z	2018-09-06T19:46:03Z	MEMBER	Removing the self-references to the dask graphs in #2261 seems to resolve the performance issue on its own. I would be interested if https://github.com/pydata/xarray/pull/2391 still improves performance in any real world yes cases -- perhaps it helps when working with a real cluster or on large datasets? I can't see any difference in my local benchmarks using dask-distributed.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417380229	https://github.com/pydata/xarray/issues/2389#issuecomment-417380229	https://api.github.com/repos/pydata/xarray/issues/2389	MDEyOklzc3VlQ29tbWVudDQxNzM4MDIyOQ==	shoyer 1217238	2018-08-30T16:24:07Z	2018-08-30T16:24:07Z	MEMBER	OK, so it seems like the complete solution here should involve refactoring our backend classes to avoid any references to objects storing dask graphs. This is a cleaner solution even regardless of the pickle overhead because it allows us to eliminate all state stored in backend classes. I'll get on that in #2261.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417176707	https://github.com/pydata/xarray/issues/2389#issuecomment-417176707	https://api.github.com/repos/pydata/xarray/issues/2389	MDEyOklzc3VlQ29tbWVudDQxNzE3NjcwNw==	shoyer 1217238	2018-08-30T03:18:33Z	2018-08-30T03:18:33Z	MEMBER	Give https://github.com/pydata/xarray/pull/2391 a try -- in my testing, it speeds up both examples to only take about 3 second each.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417076301	https://github.com/pydata/xarray/issues/2389#issuecomment-417076301	https://api.github.com/repos/pydata/xarray/issues/2389	MDEyOklzc3VlQ29tbWVudDQxNzA3NjMwMQ==	shoyer 1217238	2018-08-29T19:29:56Z	2018-08-29T19:29:56Z	MEMBER	If I understand the heuristics used by dask's schedulers correctly, a data dependency might actually be a good idea here because it would encourage colocating write tasks on the same machines. We should probably give this a try. On Wed, Aug 29, 2018 at 12:15 PM Matthew Rocklin notifications@github.com wrote: It would be nice if dask had a way to consolidate the serialization of these objects, rather than separately serializing them in each task. You can make it a separate task (often done by wrapping with dask.delayed) and then use that key within other objets. This does create a data dependency though, which can make the graph somewhat more complex. In normal use of Pickle these things are cached and reused. Unfortunately we can't do this because we're sending the tasks to different machines, each of which will need to deserialize independently. — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2389#issuecomment-417072024, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1q8fMKCsVKmxjvANnMFS2Rn_6_6Jks5uVug-gaJpZM4WSBVj .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417066100	https://github.com/pydata/xarray/issues/2389#issuecomment-417066100	https://api.github.com/repos/pydata/xarray/issues/2389	MDEyOklzc3VlQ29tbWVudDQxNzA2NjEwMA==	shoyer 1217238	2018-08-29T18:55:39Z	2018-08-29T18:55:39Z	MEMBER	I don't really know how they work, but maybe pickeling those NetCDF4ArrayWrapper objects is expensive (ie they contain a reference to something they shouldn't)? This seems plausible to me, though the situation is likely improved with #2261. It would be nice if dask had a way to consolidate the serialization of these objects, rather than separately serializing them in each task. It's not obvious to me how to do that in xarray short of manually building task graphs so those `NetCDF4ArrayWrapper` objects are created by dedicated tasks. CC @mrocklin in case he has thoughts here	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417047186	https://github.com/pydata/xarray/issues/2389#issuecomment-417047186	https://api.github.com/repos/pydata/xarray/issues/2389	MDEyOklzc3VlQ29tbWVudDQxNzA0NzE4Ng==	shoyer 1217238	2018-08-29T17:59:24Z	2018-08-29T17:59:24Z	MEMBER	Offhand, I don't know why `dask.delayed` should be adding this much overhead. One possibility is that when tasks are pickled (as is done by dask-distributed), the tasks are much larger because the delayed function gets serialized into each task. It does seem like pickling can add a significant amount of overhead in some cases when using xarray with dask for serialization: https://github.com/pangeo-data/pangeo/issues/266 I'm not super familiar with profiling dask, but it might be worth looking at dask's diagnostics tools (http://dask.pydata.org/en/latest/understanding-performance.html) to understand what's going on here. The appearance of `_thread.lock` in at the top of these profiles is a good indication that we aren't measuring where most of the computation is happening. It would also be interesting to see if this changes with the xarray backend refactor from https://github.com/pydata/xarray/pull/2261.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);