home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

3 rows where author_association = "CONTRIBUTOR", issue = 1581046647 and user = 39069044 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • slevang · 3 ✖

issue 1

  • Differences in `to_netcdf` for dask and numpy backed arrays · 3 ✖

author_association 1

  • CONTRIBUTOR · 3 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1450841385 https://github.com/pydata/xarray/issues/7522#issuecomment-1450841385 https://api.github.com/repos/pydata/xarray/issues/7522 IC_kwDOAMm_X85WehUp slevang 39069044 2023-03-01T21:01:48Z 2023-03-01T21:01:48Z CONTRIBUTOR

Yeah that seems to be it. Dask's write neatly packs all the needed metadata at the beginning of the file, since we can scale this up to a many GB file with dozens of variables and still read in ~100ms. While xarray is doing a less well organized write of the metadata and we have to go seeking in the middle of the byte range. cache_type="first" does provide some improvement but still not as good as on the dask-written file.

FWIW, I inspected the actual bytes of the dask and xarray written files and they are identical for a single variable, but diverge when multiple variables are being written. So, the important differences are probably associated with this step:

It does set up the whole set of variables as a initialisation stage before writing any data - I don't know if xarray does this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Differences in `to_netcdf` for dask and numpy backed arrays 1581046647
1449302032 https://github.com/pydata/xarray/issues/7522#issuecomment-1449302032 https://api.github.com/repos/pydata/xarray/issues/7522 IC_kwDOAMm_X85WYpgQ slevang 39069044 2023-03-01T04:04:25Z 2023-03-01T04:04:25Z CONTRIBUTOR

The slow file:

And the fast file:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Differences in `to_netcdf` for dask and numpy backed arrays 1581046647
1428872842 https://github.com/pydata/xarray/issues/7522#issuecomment-1428872842 https://api.github.com/repos/pydata/xarray/issues/7522 IC_kwDOAMm_X85VKt6K slevang 39069044 2023-02-13T23:49:31Z 2023-02-13T23:49:31Z CONTRIBUTOR

I did try many loops and different order of operations to make sure this isn't a caching or auth issue. You can see the std dev of the timeit calls above is pretty consistent.

For my actual use case, the difference is very apparent, with open_dataset taking about 9 seconds on the numpy-saved file and <1 second on the dask-saved one. I can also clearly see when monitoring network traffic that the slow version has to read in hundreds of MB of data to open the dataset, while the fast one only reads the tiny headers.

I also inspected the actual header bytes of these two files and see they are indeed different.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Differences in `to_netcdf` for dask and numpy backed arrays 1581046647

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.224ms · About: xarray-datasette