home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

7 rows where issue = 238284894 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • shoyer 3
  • mrocklin 2
  • jhamman 1
  • spencerkclark 1

issue 1

  • Writing directly to a netCDF file while using distributed · 7 ✖

author_association 1

  • MEMBER 7
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
350113818 https://github.com/pydata/xarray/issues/1464#issuecomment-350113818 https://api.github.com/repos/pydata/xarray/issues/1464 MDEyOklzc3VlQ29tbWVudDM1MDExMzgxOA== jhamman 2443309 2017-12-07T22:27:08Z 2017-12-07T22:27:08Z MEMBER

The place to start is probably to write an integration test for this functionality. I notice now that our current tests only check reading netCDF files with dask-distributed:

We should probably also write some tests for saving datasets with save_mfdataset and distributed.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing directly to a netCDF file while using distributed 238284894
341329662 https://github.com/pydata/xarray/issues/1464#issuecomment-341329662 https://api.github.com/repos/pydata/xarray/issues/1464 MDEyOklzc3VlQ29tbWVudDM0MTMyOTY2Mg== shoyer 1217238 2017-11-02T06:29:38Z 2017-11-02T06:29:38Z MEMBER

I did a little bit of digging here, using @mrocklin's Client(processes=False) trick.

The problem seems to be that the arrays that we add to the writer in AbstractWritableDataStore.set_variables are not pickleable. To be more concrete, consider these lines: https://github.com/pydata/xarray/blob/f83361c76b6aa8cdba8923080bb6b98560cf3a96/xarray/backends/common.py#L221-L232

target is currently a netCDF4.Variable object (or whatever the appropriate backend type is). Anything added to the writer eventually ends up as an argument to dask.array.store and hence gets put into the dask graph. When dask-distributed tries to pickle the dask graph, it fails on the netCDF4.Variable.

What we need to instead is wrap these target arrays in appropriate array wrappers, e.g., NetCDF4ArrayWrapper, adding __setitem__ methods to the array wrappers if needed. Unlike most backend array types, our array wrappers are pickleable, which is essentially for use with dask-distributed.

If anyone's curious, here's the traceback and code I used to debug this: https://gist.github.com/shoyer/4564971a4d030cd43bba8241d3b36c73

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing directly to a netCDF file while using distributed 238284894
311122109 https://github.com/pydata/xarray/issues/1464#issuecomment-311122109 https://api.github.com/repos/pydata/xarray/issues/1464 MDEyOklzc3VlQ29tbWVudDMxMTEyMjEwOQ== shoyer 1217238 2017-06-26T17:10:07Z 2017-06-26T17:10:07Z MEMBER

I'm a little surprised that this doesn't work because I thought we made all our xarray datastore object pickle-able.

The place to start is probably to write an integration test for this functionality. I notice now that our current tests only check reading netCDF files with dask-distributed: https://github.com/pydata/xarray/blob/master/xarray/tests/test_distributed.py

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing directly to a netCDF file while using distributed 238284894
311114129 https://github.com/pydata/xarray/issues/1464#issuecomment-311114129 https://api.github.com/repos/pydata/xarray/issues/1464 MDEyOklzc3VlQ29tbWVudDMxMTExNDEyOQ== mrocklin 306380 2017-06-26T16:39:24Z 2017-06-26T16:39:24Z MEMBER

Presumably there is some object in the task graph that we don't know how to serialize. This can be fixed either in XArray, by not including such an object but recreating it each time or wrapping it, or in Dask, by learning how to (de)serialize it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing directly to a netCDF file while using distributed 238284894
310943792 https://github.com/pydata/xarray/issues/1464#issuecomment-310943792 https://api.github.com/repos/pydata/xarray/issues/1464 MDEyOklzc3VlQ29tbWVudDMxMDk0Mzc5Mg== spencerkclark 6628425 2017-06-26T01:38:17Z 2017-06-26T01:38:17Z MEMBER

@shoyer @mrocklin thanks for your quick responses; I can confirm that both the workarounds you suggested work in my case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing directly to a netCDF file while using distributed 238284894
310817771 https://github.com/pydata/xarray/issues/1464#issuecomment-310817771 https://api.github.com/repos/pydata/xarray/issues/1464 MDEyOklzc3VlQ29tbWVudDMxMDgxNzc3MQ== mrocklin 306380 2017-06-24T06:17:52Z 2017-06-24T06:17:52Z MEMBER

It's failing to serialize something in the task graph, I'm not sure what (I'm also surprised that the except clause didn't trigger and log the input). My first guess is that there is an open netcdf file object floating around within the task graph. If so then we should endeavor to avoid doing this (or have some file object proxy that is (de)serializable.

As a short-term workaround you might try starting a local cluster within the same process.

client = Client(processes=False)

This might help you to avoid serialization issues. Generally we should resolve the issue regardless though.

cc'ing @rabernat, who seems to have the most experience here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing directly to a netCDF file while using distributed 238284894
310817117 https://github.com/pydata/xarray/issues/1464#issuecomment-310817117 https://api.github.com/repos/pydata/xarray/issues/1464 MDEyOklzc3VlQ29tbWVudDMxMDgxNzExNw== shoyer 1217238 2017-06-24T06:05:09Z 2017-06-24T06:05:09Z MEMBER

Hmm. Can you try using scipy as an engine to write the netcdf file?

Honestly I've barely used dask distributed. Possibly @mrocklin has ideas.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing directly to a netCDF file while using distributed 238284894

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.272ms · About: xarray-datasette