home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

9 rows where author_association = "MEMBER" and issue = 107424151 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • shoyer 5
  • rabernat 3
  • clarkfitzg 1

issue 1

  • Parallel map/apply powered by dask.array · 9 ✖

author_association 1

  • MEMBER · 9 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
336494114 https://github.com/pydata/xarray/issues/585#issuecomment-336494114 https://api.github.com/repos/pydata/xarray/issues/585 MDEyOklzc3VlQ29tbWVudDMzNjQ5NDExNA== shoyer 1217238 2017-10-13T15:58:30Z 2017-10-13T15:58:30Z MEMBER

@rabernat Agreed, let's open a new issue for that.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel map/apply powered by dask.array 107424151
336489532 https://github.com/pydata/xarray/issues/585#issuecomment-336489532 https://api.github.com/repos/pydata/xarray/issues/585 MDEyOklzc3VlQ29tbWVudDMzNjQ4OTUzMg== rabernat 1197350 2017-10-13T15:41:32Z 2017-10-13T15:41:32Z MEMBER

This issue was closed by #1517. But there was plenty of discussion above about parallelizing groupby. Does #1517 make parallel groupby automatically work? My understanding is no. If that's the case, we probably need to open a new issue for parallel groupby.

cc @mrocklin

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel map/apply powered by dask.array 107424151
324518345 https://github.com/pydata/xarray/issues/585#issuecomment-324518345 https://api.github.com/repos/pydata/xarray/issues/585 MDEyOklzc3VlQ29tbWVudDMyNDUxODM0NQ== shoyer 1217238 2017-08-24T02:52:26Z 2017-08-24T02:52:26Z MEMBER

I have a preliminary implementation up in https://github.com/pydata/xarray/pull/1517

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel map/apply powered by dask.array 107424151
142482576 https://github.com/pydata/xarray/issues/585#issuecomment-142482576 https://api.github.com/repos/pydata/xarray/issues/585 MDEyOklzc3VlQ29tbWVudDE0MjQ4MjU3Ng== shoyer 1217238 2015-09-23T03:49:46Z 2017-03-07T05:32:28Z MEMBER

Indeed, there's no need to load the entire dataset into memory first. I think open_mfdataset is the model to emulate here -- it's parallelism that just works.

I'm not quite sure how to do this transparently in groupby operations yet. The problem is that you do want to apply some groupby operations on dask arrays without loading the entire group into memory, if there are only a few groups on a large datasets and the function itself is written in terms of dask operations. I think we will probably need some syntax to disambiguate that scenario.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel map/apply powered by dask.array 107424151
249011817 https://github.com/pydata/xarray/issues/585#issuecomment-249011817 https://api.github.com/repos/pydata/xarray/issues/585 MDEyOklzc3VlQ29tbWVudDI0OTAxMTgxNw== shoyer 1217238 2016-09-22T20:00:57Z 2016-09-22T20:00:57Z MEMBER

I think #964 provides a viable path forward here.

Previously, I was imagining the user provides an function that maps xarray.DataArray -> xarray.DataArray. Such functions are tricky to parallelize with dask.array because need to run them to figure out the result dimensions/coordinates.

In contrast, with a user defined function ndarray -> ndarray, it's fairly straightforward to parallelize these with dask array (e.g., using dask.array.elemwise or dask.array.map_blocks). Then we could add the metadata back in afterwards with #964.

In principle, we could do this automatically -- especially if dask had a way to parallelize arbitrary NumPy generalized universal functions. Then the user could write something like xarray.apply(func, data, signature=signature, dask_array='auto') to automatically parallelize func over their data. In fact, I had this in some previous commits for #964, but took it out for now, just to reduce scope for the change.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel map/apply powered by dask.array 107424151
248979862 https://github.com/pydata/xarray/issues/585#issuecomment-248979862 https://api.github.com/repos/pydata/xarray/issues/585 MDEyOklzc3VlQ29tbWVudDI0ODk3OTg2Mg== rabernat 1197350 2016-09-22T18:00:24Z 2016-09-22T18:00:24Z MEMBER

Does #964 help on this?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel map/apply powered by dask.array 107424151
226262120 https://github.com/pydata/xarray/issues/585#issuecomment-226262120 https://api.github.com/repos/pydata/xarray/issues/585 MDEyOklzc3VlQ29tbWVudDIyNjI2MjEyMA== shoyer 1217238 2016-06-15T17:37:11Z 2016-06-15T17:37:11Z MEMBER

With the single machine version of dask, we need to run one block first to infer the appropriate metadata for constructing the combined dataset.

Potentially a better approach would be to optionally leverage dask.distributed, which has the ability to run computation at the same time as graph construction. map_blocks could then kick off a bunch of map tasks to execute in parallel, and only worry about reassembling the blocks in a reduce after the results have come in.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel map/apply powered by dask.array 107424151
143692567 https://github.com/pydata/xarray/issues/585#issuecomment-143692567 https://api.github.com/repos/pydata/xarray/issues/585 MDEyOklzc3VlQ29tbWVudDE0MzY5MjU2Nw== rabernat 1197350 2015-09-28T09:43:17Z 2015-09-28T09:43:17Z MEMBER

:+1: Very useful idea!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel map/apply powered by dask.array 107424151
142480620 https://github.com/pydata/xarray/issues/585#issuecomment-142480620 https://api.github.com/repos/pydata/xarray/issues/585 MDEyOklzc3VlQ29tbWVudDE0MjQ4MDYyMA== clarkfitzg 5356122 2015-09-23T03:32:23Z 2015-09-23T03:32:23Z MEMBER

But do the xray objects have to exist in memory? I was thinking this could also work along with open_mfdataset. It just loads and operates on the chunk it needs.

Like the idea of applying this to groupby objects. I wonder if it could be done transparently to the user...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel map/apply powered by dask.array 107424151

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 12.162ms · About: xarray-datasette