home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

8 rows where issue = 711626733 sorted by updated_at descending

✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • shoyer 4
  • max-sixty 3
  • bmorris3 1

author_association 2

  • MEMBER 7
  • NONE 1

issue 1

  • Wrap numpy-groupies to speed up Xarray's groupby aggregations · 8 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1046935567 https://github.com/pydata/xarray/issues/4473#issuecomment-1046935567 https://api.github.com/repos/pydata/xarray/issues/4473 IC_kwDOAMm_X84-ZvgP bmorris3 3497584 2022-02-21T14:25:47Z 2022-02-21T14:25:47Z NONE

Hi @shoyer, thanks for this neat trick! What happens when bins is a sequence of bin edges, rather than a number of bins? Your example seems to break and I'm not sure how to fix it. Thanks again!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Wrap numpy-groupies to speed up Xarray's groupby aggregations 711626733
711461331 https://github.com/pydata/xarray/issues/4473#issuecomment-711461331 https://api.github.com/repos/pydata/xarray/issues/4473 MDEyOklzc3VlQ29tbWVudDcxMTQ2MTMzMQ== shoyer 1217238 2020-10-19T01:30:48Z 2020-10-19T01:30:48Z MEMBER
  • What's the best way of reconstituting the coords etc, after npg produces the array?

I think we can reuse the existing logic from the _combine method here: https://github.com/pydata/xarray/blob/97e26257e81b0ba35af4a34be43a3e9cc666b9bc/xarray/core/groupby.py#L830

This just gives us an alternative way to calculate applied.

  • Presumably we're going to have a fairly different design for this than the existing groupby operations — that design is very nested — wrapping functions and eventually calling .map to loop over each group in python.

Agreed. Hopefully this can live alongside in the GroupBy objects.

  • Presumably we're going to need to keep the existing logic around for dask — is it reasonable for an initial version to defer to the existing logic for all dask arrays? (+ @shoyer 's thoughts above on this)

Yes, I agree that we should do this incrementally.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Wrap numpy-groupies to speed up Xarray's groupby aggregations 711626733
711460703 https://github.com/pydata/xarray/issues/4473#issuecomment-711460703 https://api.github.com/repos/pydata/xarray/issues/4473 MDEyOklzc3VlQ29tbWVudDcxMTQ2MDcwMw== shoyer 1217238 2020-10-19T01:27:50Z 2020-10-19T01:27:50Z MEMBER

Something like the resample test case from https://github.com/pydata/xarray/issues/4498 might be a good example for finding 100x speed-ups. The main feature of that case is that there are a very large number of groups (only slightly fewer groups than original data points).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Wrap numpy-groupies to speed up Xarray's groupby aggregations 711626733
711458608 https://github.com/pydata/xarray/issues/4473#issuecomment-711458608 https://api.github.com/repos/pydata/xarray/issues/4473 MDEyOklzc3VlQ29tbWVudDcxMTQ1ODYwOA== max-sixty 5635139 2020-10-19T01:19:08Z 2020-10-19T01:19:08Z MEMBER

Here's a very quick POC:

```python from numpy_groupies.aggregate_numba import aggregate

def npg_groupby(da: xr.DataArray, dim, func='sum'): group_idx, labels = pd.factorize(da.indexes[dim]) axis = da.get_axis_num(dim) array = npg.aggregate(group_idx=group_idx, a=da, func=func, axis=axis) return array ```

Run on this array: ```python size_factor = 1000

da = xr.DataArray( np.arange(1440 * size_factor).reshape(45 * size_factor, 8, 4), dims=("x", "y", "z"), coords=dict(x=list(range(45)) * size_factor, y=[1, 2, 3, 4] * 2, z=[1, 2] * 2), ) ```

It's about 2x as fast, though only generates the numpy array:

```python %%timeit npg_groupby(da, 'x')

15 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

```

```python %%timeit da.groupby('x').sum()

37.6 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

```

Any thoughts on any of: - What's the best way of reconstituting the coords etc, after npg produces the array? - Presumably we're going to have a fairly different design for this than the existing groupby operations — that design is very nested — wrapping functions and eventually calling .map to loop over each group in python. - Presumably we're going to need to keep the existing logic around for dask — is it reasonable for an initial version to defer to the existing logic for all dask arrays? (+ @shoyer 's thoughts above on this)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Wrap numpy-groupies to speed up Xarray's groupby aggregations 711626733
702339435 https://github.com/pydata/xarray/issues/4473#issuecomment-702339435 https://api.github.com/repos/pydata/xarray/issues/4473 MDEyOklzc3VlQ29tbWVudDcwMjMzOTQzNQ== max-sixty 5635139 2020-10-01T19:07:39Z 2020-10-01T19:07:39Z MEMBER

I'm not entirely sure, but I suspect something like the approach in #4184 might be more directly relevant for speeding up unstack (at least with NumPy arrays).

Great. I need to think through how to do that — the approach of using MultiIndex codes to index the array directly is very elegant — I'll try applying it to stack / unstack as a project.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Wrap numpy-groupies to speed up Xarray's groupby aggregations 711626733
701961598 https://github.com/pydata/xarray/issues/4473#issuecomment-701961598 https://api.github.com/repos/pydata/xarray/issues/4473 MDEyOklzc3VlQ29tbWVudDcwMTk2MTU5OA== shoyer 1217238 2020-10-01T07:57:58Z 2020-10-01T07:57:58Z MEMBER

Highly speculative, but would this also be a faster approach to stacking & unstacking? "Form ~5~ 4" in the readme.

I'm not entirely sure, but I suspect something like the approach in https://github.com/pydata/xarray/pull/4184 might be more directly relevant for speeding up unstack (at least with NumPy arrays).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Wrap numpy-groupies to speed up Xarray's groupby aggregations 711626733
701653835 https://github.com/pydata/xarray/issues/4473#issuecomment-701653835 https://api.github.com/repos/pydata/xarray/issues/4473 MDEyOklzc3VlQ29tbWVudDcwMTY1MzgzNQ== max-sixty 5635139 2020-09-30T21:22:40Z 2020-10-01T05:56:13Z MEMBER

This looks amazing! Thanks for finding it.

Highly speculative, but would this also be a faster approach to stacking & unstacking? "Form ~5~ 4" in the readme.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Wrap numpy-groupies to speed up Xarray's groupby aggregations 711626733
701609035 https://github.com/pydata/xarray/issues/4473#issuecomment-701609035 https://api.github.com/repos/pydata/xarray/issues/4473 MDEyOklzc3VlQ29tbWVudDcwMTYwOTAzNQ== shoyer 1217238 2020-09-30T19:52:05Z 2020-09-30T19:52:05Z MEMBER

A prototype implementation of the core functionality here can be found in: https://nbviewer.jupyter.org/gist/shoyer/6d6c82bbf383fb717cc8631869678737

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Wrap numpy-groupies to speed up Xarray's groupby aggregations 711626733

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.163ms · About: xarray-datasette
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows