home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

2 rows where author_association = "MEMBER", issue = 435535284 and user = 2443309 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • jhamman · 2 ✖

issue 1

  • Writing a netCDF file is unexpectedly slow · 2 ✖

author_association 1

  • MEMBER · 2 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
534855337 https://github.com/pydata/xarray/issues/2912#issuecomment-534855337 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDUzNDg1NTMzNw== jhamman 2443309 2019-09-25T05:12:32Z 2019-09-25T05:12:32Z MEMBER

@fsteinmetz - in my experience, the main thing to consider here is how and when xarray's backends lock/block for certain operations. The hdf5 library is not thread safe and so we implement a global lock around all hdf5 read/write operations. In most cases, this means we can only do one read or one write at a time per process. We have found that using Dask's distributed (or mulitprocessing) scheduler allows us to bypass the thread locks required by hdf5 by using multiple processes. We also need a per file lock when writing, so using multiple output datasets theoretically allows for concurrent writes (provided your filesystem and OS support this).

Finally, its best not to jump to the complicated explanations first. If you have many small dask chunks in your dataset, both reading and writing will be quite inefficient. This is simply because there is some non-trivial overhead when accessing partial datasets. This is even worse when the dataset is chunked/compressed.

Hope that helps.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
485497398 https://github.com/pydata/xarray/issues/2912#issuecomment-485497398 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDQ4NTQ5NzM5OA== jhamman 2443309 2019-04-22T18:06:56Z 2019-04-22T18:06:56Z MEMBER

Since the final dataset size is quite manageable, I would start by forcing computation before the write step:

python ncdat.load().to_netcdf(...)

While writing of xarray datasets backed by dask is possible, its a poorly optimized operation. Most of this comes from constraints in netCDF4/HDF5. There are ways to side step some of these challenges (save_mfdataset and the distributed dask scheduler) but they are probably overkill for this use case.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 67.862ms · About: xarray-datasette