home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

11 rows where issue = 355264812 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • shoyer 6
  • aseyboldt 3
  • mrocklin 2

author_association 2

  • MEMBER 8
  • NONE 3

issue 1

  • Large pickle overhead in ds.to_netcdf() involving dask.delayed functions · 11 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
419218306 https://github.com/pydata/xarray/issues/2389#issuecomment-419218306 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxOTIxODMwNg== shoyer 1217238 2018-09-06T19:46:03Z 2018-09-06T19:46:03Z MEMBER

Removing the self-references to the dask graphs in #2261 seems to resolve the performance issue on its own.

I would be interested if https://github.com/pydata/xarray/pull/2391 still improves performance in any real world yes cases -- perhaps it helps when working with a real cluster or on large datasets? I can't see any difference in my local benchmarks using dask-distributed.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417380229 https://github.com/pydata/xarray/issues/2389#issuecomment-417380229 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzM4MDIyOQ== shoyer 1217238 2018-08-30T16:24:07Z 2018-08-30T16:24:07Z MEMBER

OK, so it seems like the complete solution here should involve refactoring our backend classes to avoid any references to objects storing dask graphs. This is a cleaner solution even regardless of the pickle overhead because it allows us to eliminate all state stored in backend classes. I'll get on that in #2261.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417252006 https://github.com/pydata/xarray/issues/2389#issuecomment-417252006 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzI1MjAwNg== aseyboldt 1882397 2018-08-30T09:23:20Z 2018-08-30T09:48:40Z NONE

It seems the xarray object that is sent to the workers contains a reference to the complete graph:

```python vals = da.random.random((5, 1), chunks=(1, 1)) ds = xr.Dataset({'vals': (['a', 'b'], vals)}) write = ds.to_netcdf('file2.nc', compute=False)

key = [val for val in write.dask.keys() if isinstance(val, str) and val.startswith('NetCDF')][0] wrapper = write.dask[key] len(pickle.dumps(wrapper))

14652

delayed_store = wrapper.datastore.delayed_store len(pickle.dumps(delayed_store))

14652

dask.visualize(delayed_store) ```

The size jumps to the 1.3MB if I use 500 chunks again.

The warning about the large object in the graph disappears if we delete that reference before we execute the graph: key = [val for val in write.dask.keys() if isinstance(val,str) and val.startswith('NetCDF')][0] wrapper = write.dask[key] del wrapper.datastore.delayed_store It doesn't to change the runtime though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417242425 https://github.com/pydata/xarray/issues/2389#issuecomment-417242425 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzI0MjQyNQ== aseyboldt 1882397 2018-08-30T08:53:21Z 2018-08-30T08:53:21Z NONE

Ah, that seems to do the trick. I get about 4.5s for both now, and the time spent pickeling stuff is down to reasonable levels (0.022s). Also the number of function calls dropped from 1e8 to 3e5 :-)

There still seems to be some inefficiency in the pickeled graph output, I'm getting a warning about large objects in the graph:

``` /Users/adrianseyboldt/anaconda3/lib/python3.6/site-packages/distributed/worker.py:840: UserWarning: Large object of size 1.31 MB detected in task graph: ('store-03165bae-ac28-11e8-b137-56001c88cd01', <xa ... t 0x316112cc0>) Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers

future = client.submit(func, big_data)    # bad

big_future = client.scatter(big_data)     # good
future = client.submit(func, big_future)  # good

% (format_bytes(len(b)), s)) ```

The size scales linearly with the number of chunks (it is 13MB if there are 5000 chunks). This doesn't seem to be nearly as problematic as the original issue though.

This is after applying both #2391 and #2261.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417176707 https://github.com/pydata/xarray/issues/2389#issuecomment-417176707 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzE3NjcwNw== shoyer 1217238 2018-08-30T03:18:33Z 2018-08-30T03:18:33Z MEMBER

Give https://github.com/pydata/xarray/pull/2391 a try -- in my testing, it speeds up both examples to only take about 3 second each.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417076999 https://github.com/pydata/xarray/issues/2389#issuecomment-417076999 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzA3Njk5OQ== mrocklin 306380 2018-08-29T19:32:17Z 2018-08-29T19:32:17Z MEMBER

I wouldn't expect this to sway things too much, but yes, there is a chance that that would happen.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417076301 https://github.com/pydata/xarray/issues/2389#issuecomment-417076301 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzA3NjMwMQ== shoyer 1217238 2018-08-29T19:29:56Z 2018-08-29T19:29:56Z MEMBER

If I understand the heuristics used by dask's schedulers correctly, a data dependency might actually be a good idea here because it would encourage colocating write tasks on the same machines. We should probably give this a try. On Wed, Aug 29, 2018 at 12:15 PM Matthew Rocklin notifications@github.com wrote:

It would be nice if dask had a way to consolidate the serialization of these objects, rather than separately serializing them in each task.

You can make it a separate task (often done by wrapping with dask.delayed) and then use that key within other objets. This does create a data dependency though, which can make the graph somewhat more complex.

In normal use of Pickle these things are cached and reused. Unfortunately we can't do this because we're sending the tasks to different machines, each of which will need to deserialize independently.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2389#issuecomment-417072024, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1q8fMKCsVKmxjvANnMFS2Rn_6_6Jks5uVug-gaJpZM4WSBVj .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417072024 https://github.com/pydata/xarray/issues/2389#issuecomment-417072024 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzA3MjAyNA== mrocklin 306380 2018-08-29T19:15:10Z 2018-08-29T19:15:10Z MEMBER

It would be nice if dask had a way to consolidate the serialization of these objects, rather than separately serializing them in each task.

You can make it a separate task (often done by wrapping with dask.delayed) and then use that key within other objets. This does create a data dependency though, which can make the graph somewhat more complex.

In normal use of Pickle these things are cached and reused. Unfortunately we can't do this because we're sending the tasks to different machines, each of which will need to deserialize independently.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417066100 https://github.com/pydata/xarray/issues/2389#issuecomment-417066100 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzA2NjEwMA== shoyer 1217238 2018-08-29T18:55:39Z 2018-08-29T18:55:39Z MEMBER

I don't really know how they work, but maybe pickeling those NetCDF4ArrayWrapper objects is expensive (ie they contain a reference to something they shouldn't)?

This seems plausible to me, though the situation is likely improved with #2261. It would be nice if dask had a way to consolidate the serialization of these objects, rather than separately serializing them in each task. It's not obvious to me how to do that in xarray short of manually building task graphs so those NetCDF4ArrayWrapper objects are created by dedicated tasks.

CC @mrocklin in case he has thoughts here

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417060359 https://github.com/pydata/xarray/issues/2389#issuecomment-417060359 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzA2MDM1OQ== aseyboldt 1882397 2018-08-29T18:37:57Z 2018-08-29T18:40:16Z NONE

pangeo-data/gangeo#266 sounds somewhat similar. If you increase the size of the involved arrays here, you also end up with warnings about the size of the graph: https://stackoverflow.com/questions/52039697/how-to-avoid-large-objects-in-task-graph

I haven't tried with #2261 applied, but I can try that tomorrow.

If we interpret the time spent in _thread.lock as the time the main process is waiting for the workers, then that doesn't seem to be that main problem here. We spend 60s in pickle (almost all the time), and only 7s waiting for locks. I tried looking at the contents of the graph a bit (write.dask.dicts) and compared that to the graph of the dataset itself (ds.vals.data.dask.dicts). I can't pickle those for some reason (that would be great to see where it is spending all that time), but it looks like those entries the main difference: ( <function dask.array.core.store_chunk(x, out, index, lock, return_stored)>, ( 'stack-6ab3acdaa825862b99d6dbe1c75f0392', 478 ), <xarray.backends.netCDF4_.NetCDF4ArrayWrapper at 0x32fc365c0>, (slice(478, 479, None), ), CombinedLock([<SerializableLock: 0ccceef3-44cd-41ed-947c-f7041ae280c8>, <distributed.lock.Lock object at 0x32fb058d0>]), False), I don't really know how they work, but maybe pickeling those NetCDF4ArrayWrapper objects is expensive (ie they contain a reference to something they shouldn't)?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417047186 https://github.com/pydata/xarray/issues/2389#issuecomment-417047186 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzA0NzE4Ng== shoyer 1217238 2018-08-29T17:59:24Z 2018-08-29T17:59:24Z MEMBER

Offhand, I don't know why dask.delayed should be adding this much overhead. One possibility is that when tasks are pickled (as is done by dask-distributed), the tasks are much larger because the delayed function gets serialized into each task. It does seem like pickling can add a significant amount of overhead in some cases when using xarray with dask for serialization: https://github.com/pangeo-data/pangeo/issues/266

I'm not super familiar with profiling dask, but it might be worth looking at dask's diagnostics tools (http://dask.pydata.org/en/latest/understanding-performance.html) to understand what's going on here. The appearance of _thread.lock in at the top of these profiles is a good indication that we aren't measuring where most of the computation is happening.

It would also be interesting to see if this changes with the xarray backend refactor from https://github.com/pydata/xarray/pull/2261.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 12.811ms · About: xarray-datasette