home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

6 rows where author_association = "NONE" and user = 1882397 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, created_at (date), updated_at (date)

issue 3

  • Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 3
  • DOC: add initial draft of a development roadmap for xarray 2
  • Importing xarray fails if old version of bottleneck is installed 1

user 1

  • aseyboldt · 6 ✖

author_association 1

  • NONE · 6 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
417252006 https://github.com/pydata/xarray/issues/2389#issuecomment-417252006 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzI1MjAwNg== aseyboldt 1882397 2018-08-30T09:23:20Z 2018-08-30T09:48:40Z NONE

It seems the xarray object that is sent to the workers contains a reference to the complete graph:

```python vals = da.random.random((5, 1), chunks=(1, 1)) ds = xr.Dataset({'vals': (['a', 'b'], vals)}) write = ds.to_netcdf('file2.nc', compute=False)

key = [val for val in write.dask.keys() if isinstance(val, str) and val.startswith('NetCDF')][0] wrapper = write.dask[key] len(pickle.dumps(wrapper))

14652

delayed_store = wrapper.datastore.delayed_store len(pickle.dumps(delayed_store))

14652

dask.visualize(delayed_store) ```

The size jumps to the 1.3MB if I use 500 chunks again.

The warning about the large object in the graph disappears if we delete that reference before we execute the graph: key = [val for val in write.dask.keys() if isinstance(val,str) and val.startswith('NetCDF')][0] wrapper = write.dask[key] del wrapper.datastore.delayed_store It doesn't to change the runtime though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417242425 https://github.com/pydata/xarray/issues/2389#issuecomment-417242425 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzI0MjQyNQ== aseyboldt 1882397 2018-08-30T08:53:21Z 2018-08-30T08:53:21Z NONE

Ah, that seems to do the trick. I get about 4.5s for both now, and the time spent pickeling stuff is down to reasonable levels (0.022s). Also the number of function calls dropped from 1e8 to 3e5 :-)

There still seems to be some inefficiency in the pickeled graph output, I'm getting a warning about large objects in the graph:

``` /Users/adrianseyboldt/anaconda3/lib/python3.6/site-packages/distributed/worker.py:840: UserWarning: Large object of size 1.31 MB detected in task graph: ('store-03165bae-ac28-11e8-b137-56001c88cd01', <xa ... t 0x316112cc0>) Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers

future = client.submit(func, big_data)    # bad

big_future = client.scatter(big_data)     # good
future = client.submit(func, big_future)  # good

% (format_bytes(len(b)), s)) ```

The size scales linearly with the number of chunks (it is 13MB if there are 5000 chunks). This doesn't seem to be nearly as problematic as the original issue though.

This is after applying both #2391 and #2261.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
417060359 https://github.com/pydata/xarray/issues/2389#issuecomment-417060359 https://api.github.com/repos/pydata/xarray/issues/2389 MDEyOklzc3VlQ29tbWVudDQxNzA2MDM1OQ== aseyboldt 1882397 2018-08-29T18:37:57Z 2018-08-29T18:40:16Z NONE

pangeo-data/gangeo#266 sounds somewhat similar. If you increase the size of the involved arrays here, you also end up with warnings about the size of the graph: https://stackoverflow.com/questions/52039697/how-to-avoid-large-objects-in-task-graph

I haven't tried with #2261 applied, but I can try that tomorrow.

If we interpret the time spent in _thread.lock as the time the main process is waiting for the workers, then that doesn't seem to be that main problem here. We spend 60s in pickle (almost all the time), and only 7s waiting for locks. I tried looking at the contents of the graph a bit (write.dask.dicts) and compared that to the graph of the dataset itself (ds.vals.data.dask.dicts). I can't pickle those for some reason (that would be great to see where it is spending all that time), but it looks like those entries the main difference: ( <function dask.array.core.store_chunk(x, out, index, lock, return_stored)>, ( 'stack-6ab3acdaa825862b99d6dbe1c75f0392', 478 ), <xarray.backends.netCDF4_.NetCDF4ArrayWrapper at 0x32fc365c0>, (slice(478, 479, None), ), CombinedLock([<SerializableLock: 0ccceef3-44cd-41ed-947c-f7041ae280c8>, <distributed.lock.Lock object at 0x32fb058d0>]), False), I don't really know how they work, but maybe pickeling those NetCDF4ArrayWrapper objects is expensive (ie they contain a reference to something they shouldn't)?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 355264812
409992634 https://github.com/pydata/xarray/pull/2309#issuecomment-409992634 https://api.github.com/repos/pydata/xarray/issues/2309 MDEyOklzc3VlQ29tbWVudDQwOTk5MjYzNA== aseyboldt 1882397 2018-08-02T16:45:44Z 2018-08-02T16:45:44Z NONE

I noticed and appreciate those plotting additions :-)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DOC: add initial draft of a development roadmap for xarray 344093951
409828598 https://github.com/pydata/xarray/pull/2309#issuecomment-409828598 https://api.github.com/repos/pydata/xarray/issues/2309 MDEyOklzc3VlQ29tbWVudDQwOTgyODU5OA== aseyboldt 1882397 2018-08-02T07:08:27Z 2018-08-02T07:08:27Z NONE

This roadmap sounds great! We plan on using xarray to store all traces in pymc4. The biggest item on my wish list for that is better support for plotting. I regularly convert traces to pandas and use the plotting functions there, or I convert it so that I can use seaborn. Better support for hierarchical indexes sounds useful as well. They can be a bit surprising right now, and they can't be serialised to netCDF. The documentation is quite important our use case. As it is, we are asking our users to learn Bayesian stats, diagnosing sampler issues and possibly some theano or tensorflow. When we switch to xarray, our users have to learn the basics of that as well. Most won't even have heard about netCDF. One thing in particular that I noticed is that many people seem to get confused about the difference between coordinates and dimensions at some point.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DOC: add initial draft of a development roadmap for xarray 344093951
349771464 https://github.com/pydata/xarray/issues/1761#issuecomment-349771464 https://api.github.com/repos/pydata/xarray/issues/1761 MDEyOklzc3VlQ29tbWVudDM0OTc3MTQ2NA== aseyboldt 1882397 2017-12-06T20:54:18Z 2017-12-06T20:54:18Z NONE

@jhamman Ah, I didn't see that. In hindsight the install docs seem like an obvious place to look for info about this...

@maxim-lian That pandas solution would have saved me a bit of debugging :-) I guess another option would be to only replace functions that are present in Bottleneck. As long as they only add and never change functions this would allow xarray to use even very recent additions to bottleneck.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Importing xarray fails if old version of bottleneck is installed 279456192

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.628ms · About: xarray-datasette