home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where issue = 304201107 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • jhamman 2
  • jmunroe 2
  • shoyer 1

author_association 2

  • MEMBER 3
  • CONTRIBUTOR 2

issue 1

  • use dask to open datasets in parallel · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
373806224 https://github.com/pydata/xarray/issues/1981#issuecomment-373806224 https://api.github.com/repos/pydata/xarray/issues/1981 MDEyOklzc3VlQ29tbWVudDM3MzgwNjIyNA== jmunroe 6181563 2018-03-16T18:34:19Z 2018-03-16T18:34:19Z CONTRIBUTOR

distributed

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  use dask to open datasets in parallel 304201107
373802503 https://github.com/pydata/xarray/issues/1981#issuecomment-373802503 https://api.github.com/repos/pydata/xarray/issues/1981 MDEyOklzc3VlQ29tbWVudDM3MzgwMjUwMw== jhamman 2443309 2018-03-16T18:21:20Z 2018-03-16T18:21:20Z MEMBER

@jmunroe - this is good to know. Have you been using the default scheduler (multiprocessing for dask.bag) or the distributed scheduler?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  use dask to open datasets in parallel 304201107
373794415 https://github.com/pydata/xarray/issues/1981#issuecomment-373794415 https://api.github.com/repos/pydata/xarray/issues/1981 MDEyOklzc3VlQ29tbWVudDM3Mzc5NDQxNQ== jmunroe 6181563 2018-03-16T17:53:44Z 2018-03-16T17:53:44Z CONTRIBUTOR

For what's worth, this is exactly the workflow I use (https://github.com/OceansAus/cosima-cookbook) when opening a large number of netCDF files:

    bag = dask.bag.from_sequence(ncfiles)

    load_variable = lambda ncfile: xr.open_dataset(ncfile, 
                       chunks=chunks, 
                       decode_times=False)[variables]

    bag = bag.map(load_variable)

    dataarrays = bag.compute()

and then

dataarray = xr.concat(dataarrays,
                      dim='time', coords='all', )

and it appears to work well.

Code snippets from cosima-cookbook/cosima_cookbook/netcdf_index.py

{
    "total_count": 3,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  use dask to open datasets in parallel 304201107
372316094 https://github.com/pydata/xarray/issues/1981#issuecomment-372316094 https://api.github.com/repos/pydata/xarray/issues/1981 MDEyOklzc3VlQ29tbWVudDM3MjMxNjA5NA== jhamman 2443309 2018-03-12T13:51:07Z 2018-03-12T13:51:07Z MEMBER

@shoyer - we can sidestep the global HDF lock if we use multiprocessing (or the distributed scheduler as you mentioned) and the autoclose option. This is the approach I took during my initial tests. It would be great if we could use the threading library too but that does seem less applicable given the current state of the HDF library.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  use dask to open datasets in parallel 304201107
372195137 https://github.com/pydata/xarray/issues/1981#issuecomment-372195137 https://api.github.com/repos/pydata/xarray/issues/1981 MDEyOklzc3VlQ29tbWVudDM3MjE5NTEzNw== shoyer 1217238 2018-03-12T05:09:16Z 2018-03-12T05:09:16Z MEMBER

I think is definitely worth exploring and could potentially be a large win.

One potential challenge is global locking with HDF5. If opening many datasets is slow because much data needs to get read with HDF5, then multiple threads will not help -- you'll need to use multiple processes, e.g., with dask-distributed.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  use dask to open datasets in parallel 304201107

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 407.83ms · About: xarray-datasette