home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where issue = 334633212 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • shoyer 2
  • jhamman 2
  • neishm 1

author_association 2

  • MEMBER 4
  • CONTRIBUTOR 1

issue 1

  • to_netcdf(compute=False) can be slow · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
453866106 https://github.com/pydata/xarray/issues/2242#issuecomment-453866106 https://api.github.com/repos/pydata/xarray/issues/2242 MDEyOklzc3VlQ29tbWVudDQ1Mzg2NjEwNg== jhamman 2443309 2019-01-13T21:13:28Z 2019-01-13T21:13:28Z MEMBER

I just reran the example above and things seem to be resolved now. The write step for the two datasets is basically identical.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_netcdf(compute=False) can be slow 334633212
399503156 https://github.com/pydata/xarray/issues/2242#issuecomment-399503156 https://api.github.com/repos/pydata/xarray/issues/2242 MDEyOklzc3VlQ29tbWVudDM5OTUwMzE1Ng== shoyer 1217238 2018-06-22T16:33:11Z 2018-06-22T16:33:11Z MEMBER

This autoclose business is really hard to reason about in its current version, as part of the backend class. I'm hoping that refactoring it out into a separate object that we can use with composition instead of inheritance will help (e.g., alongside PickleByReconstructionWrapper).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_netcdf(compute=False) can be slow 334633212
399495668 https://github.com/pydata/xarray/issues/2242#issuecomment-399495668 https://api.github.com/repos/pydata/xarray/issues/2242 MDEyOklzc3VlQ29tbWVudDM5OTQ5NTY2OA== neishm 1554921 2018-06-22T16:10:45Z 2018-06-22T16:10:45Z CONTRIBUTOR

True, I would expect some performance hit due to writing chunk-by-chunk, however that same performance hit is present in both of the test cases.

In addition to the snippet @shoyer mentioned, I found that xarray also intentionally uses autoclose=True when writing chunks to netCDF: https://github.com/pydata/xarray/blob/73b476e4db6631b2203954dd5b138cb650e4fb8c/xarray/backends/netCDF4_.py#L45-L48

However, ensure_open only uses autoclose if the file isn't already open:

https://github.com/pydata/xarray/blob/73b476e4db6631b2203954dd5b138cb650e4fb8c/xarray/backends/common.py#L496-L503

So if the file is already open before getting to BaseNetCDF4Array__setitem__, it will remain open. If the file isn't yet opened, it will be opened, but then immediately closed after writing the chunk. I suspect this is what's happening in the delayed version - the starting state of NetCDF4DataStore._isopen is False for some reason, and so it is doomed to re-close itself for each chunk processed.

If I remove the autoclose=True from BaseNetCDF4Array__setitem__, the file remains open and performance is comparable between the two tests.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_netcdf(compute=False) can be slow 334633212
399320127 https://github.com/pydata/xarray/issues/2242#issuecomment-399320127 https://api.github.com/repos/pydata/xarray/issues/2242 MDEyOklzc3VlQ29tbWVudDM5OTMyMDEyNw== jhamman 2443309 2018-06-22T04:51:54Z 2018-06-22T04:51:54Z MEMBER

I think, at least to some extent, the performance hit is to be expected. I don't think we should be opening the file more than once when using the serial or threaded schedulers so that may be a place where you can find some improvement. There will always be a performance hit when writing dask arrays to netcdf files chunk-by-chunk. For 1, there is a threading lock that limits parallel throughput. More importantly, the chunked writes are going to always be slower than larger reads coming directly from numpy arrays.

In your example above, the snippit @shoyer mentions should evaluate to autoclose=False. However, the profiling you mention seems to indicate the opposite. Perhaps we should start by digging deeper on that point.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_netcdf(compute=False) can be slow 334633212
399275847 https://github.com/pydata/xarray/issues/2242#issuecomment-399275847 https://api.github.com/repos/pydata/xarray/issues/2242 MDEyOklzc3VlQ29tbWVudDM5OTI3NTg0Nw== shoyer 1217238 2018-06-21T23:37:10Z 2018-06-21T23:37:10Z MEMBER

I suspect this can be improved. Looking at the code, it appears that we only intentionally use autoclose=True for writes when using multiprocessing or the distributed dask scheduler. https://github.com/pydata/xarray/blob/73b476e4db6631b2203954dd5b138cb650e4fb8c/xarray/backends/api.py#L709-L710

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_netcdf(compute=False) can be slow 334633212

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.992ms · About: xarray-datasette