home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

9 rows where issue = 117039129 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 5

  • shoyer 4
  • anntzer 2
  • andersy005 1
  • jjpr-mit 1
  • lanougue 1

author_association 3

  • MEMBER 5
  • CONTRIBUTOR 2
  • NONE 2

issue 1

  • groupby very slow compared to pandas · 9 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1126083413 https://github.com/pydata/xarray/issues/659#issuecomment-1126083413 https://api.github.com/repos/pydata/xarray/issues/659 IC_kwDOAMm_X85DHqtV andersy005 13301940 2022-05-13T13:55:20Z 2022-05-13T13:55:20Z MEMBER

5734 has greatly improved the performance. Fantastic work @dcherian 👏🏽

```python In [13]: import xarray as xr, pandas as pd, numpy as np

In [14]: ds = xr.Dataset({"a": xr.DataArray(np.r_[np.arange(500.), np.arange(500.)]), ...: "b": xr.DataArray(np.arange(1000.))})

In [15]: ds Out[15]: <xarray.Dataset> Dimensions: (dim_0: 1000) Dimensions without coordinates: dim_0 Data variables: a (dim_0) float64 0.0 1.0 2.0 3.0 4.0 ... 496.0 497.0 498.0 499.0 b (dim_0) float64 0.0 1.0 2.0 3.0 4.0 ... 996.0 997.0 998.0 999.0 ```

```python In [16]: xr.set_options(use_flox=True) Out[16]: <xarray.core.options.set_options at 0x104de21a0>

In [17]: %%timeit ...: ds.groupby("a").mean() ...: ...: 1.5 ms ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [18]: xr.set_options(use_flox=False) Out[18]: <xarray.core.options.set_options at 0x144382350>

In [19]: %%timeit ...: ds.groupby("a").mean() ...: ...: 94 ms ± 715 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ```

{
    "total_count": 4,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 4,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby very slow compared to pandas 117039129
522592263 https://github.com/pydata/xarray/issues/659#issuecomment-522592263 https://api.github.com/repos/pydata/xarray/issues/659 MDEyOklzc3VlQ29tbWVudDUyMjU5MjI2Mw== lanougue 32069530 2019-08-19T14:09:36Z 2019-08-19T14:09:36Z NONE

I gave a look to functions such as "np.add.at" which can be highly faster than home-made solution. The aggregate function of the "numpy-groupies" package is even faster (25 x faster than np.add.at in my case). Maybe xarray groupby functionalities can rely on such effective package.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby very slow compared to pandas 117039129
334212532 https://github.com/pydata/xarray/issues/659#issuecomment-334212532 https://api.github.com/repos/pydata/xarray/issues/659 MDEyOklzc3VlQ29tbWVudDMzNDIxMjUzMg== jjpr-mit 25231875 2017-10-04T16:27:21Z 2017-10-04T16:27:21Z NONE

In case anyone gets here by Googling something like "xarray groupby slow" and you loaded data from a netCDF file, be aware that slowness you see in groupby aggregation on a Dataset or DataArray may actually be due not to this issue but to the lazy loading that's done by default. This can be fixed by calling .load() on the Dataset or DataArray. See the Tip about lazy loading at http://xarray.pydata.org/en/stable/io.html#netcdf.

{
    "total_count": 9,
    "+1": 6,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 1,
    "rocket": 1,
    "eyes": 0
}
  groupby very slow compared to pandas 117039129
200417621 https://github.com/pydata/xarray/issues/659#issuecomment-200417621 https://api.github.com/repos/pydata/xarray/issues/659 MDEyOklzc3VlQ29tbWVudDIwMDQxNzYyMQ== shoyer 1217238 2016-03-23T16:13:32Z 2016-03-23T16:13:32Z MEMBER

Another approach here (rather than writing something new with Numba) would be to write a pure NumPy engine for groupby that relies on reordering data and np.add.accumulate. This could yield performance within a factor of 2-3x slower than pandas. See this comment for an example: https://github.com/numpy/numpy/issues/7265#issuecomment-198796408

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby very slow compared to pandas 117039129
157130467 https://github.com/pydata/xarray/issues/659#issuecomment-157130467 https://api.github.com/repos/pydata/xarray/issues/659 MDEyOklzc3VlQ29tbWVudDE1NzEzMDQ2Nw== shoyer 1217238 2015-11-16T18:37:51Z 2015-11-16T18:37:51Z MEMBER

Agreed! If you'd like to make a pull request that would be greatly appreciated

On Sun, Nov 15, 2015 at 10:10 PM, Antony Lee notifications@github.com wrote:

Perhaps worth mentioning in the docs? The difference turned out to be a major bottleneck in my code.

— Reply to this email directly or view it on GitHub https://github.com/xray/xray/issues/659#issuecomment-156925589.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby very slow compared to pandas 117039129
156925589 https://github.com/pydata/xarray/issues/659#issuecomment-156925589 https://api.github.com/repos/pydata/xarray/issues/659 MDEyOklzc3VlQ29tbWVudDE1NjkyNTU4OQ== anntzer 1322974 2015-11-16T06:10:25Z 2015-11-16T06:10:25Z CONTRIBUTOR

Perhaps worth mentioning in the docs? The difference turned out to be a major bottleneck in my code.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby very slow compared to pandas 117039129
156921310 https://github.com/pydata/xarray/issues/659#issuecomment-156921310 https://api.github.com/repos/pydata/xarray/issues/659 MDEyOklzc3VlQ29tbWVudDE1NjkyMTMxMA== shoyer 1217238 2015-11-16T05:40:09Z 2015-11-16T05:40:09Z MEMBER

Yes, switching to pandas for these operations is certainly a recommended approach :).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby very slow compared to pandas 117039129
156917053 https://github.com/pydata/xarray/issues/659#issuecomment-156917053 https://api.github.com/repos/pydata/xarray/issues/659 MDEyOklzc3VlQ29tbWVudDE1NjkxNzA1Mw== anntzer 1322974 2015-11-16T05:14:50Z 2015-11-16T05:14:50Z CONTRIBUTOR

In my case I could just switch to pandas, so I'll leave it as it is for now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby very slow compared to pandas 117039129
156915727 https://github.com/pydata/xarray/issues/659#issuecomment-156915727 https://api.github.com/repos/pydata/xarray/issues/659 MDEyOklzc3VlQ29tbWVudDE1NjkxNTcyNw== shoyer 1217238 2015-11-16T04:57:24Z 2015-11-16T04:57:24Z MEMBER

Yes, I'm afraid this is a known issue. Grouped aggregations are currently implemented with a loop in pure Python, which, of course, is pretty slow.

I've done some exploratory work to rewrite them in Numba, which shows some encouraging preliminary results:

``` from numba import guvectorize, jit import pandas as pd import numpy as np

@guvectorize(['(float64[:], int64[:], float64[:])'], '(x),(x),(y)', nopython=True) def _grouped_mean(values, int_labels, target): count = np.zeros(len(target), np.int64) for i in range(len(values)): val = values[i] if not np.isnan(val): lab = int_labels[i] target[lab] += val count[lab] += 1 target /= count

def move_axis_to_end(array, axis): array = np.asarray(array) return np.rollaxis(array, axis, start=array.ndim)

def grouped_mean(values, by, axis=-1): int_labels, uniques = pd.factorize(by, sort=True) values = move_axis_to_end(values, axis) target = np.zeros(values.shape[:-1] + uniques.shape) _grouped_mean(values, int_labels, target) return target, uniques

values = np.random.RandomState(0).rand(int(1e6)) values[::50] = np.nan by = np.random.randint(50, size=int(1e6)) df = pd.DataFrame({'x': values, 'y': by})

np.testing.assert_allclose(grouped_mean(values, by)[0], df.groupby('y')['x'].mean())

%timeit grouped_mean(values, by) # 100 loops, best of 3: 15.3 ms per loop %timeit df.groupby('y').mean() # 10 loops, best of 3: 21.4 ms per loop ```

Unfortunately, I'm unlikely to have time to work on this in the near future. If you or anyone else is interested in taking the lead on this, it would be greatly appreciated!

Note that we can't reuse the routines from pandas because they are only designed for 1D or at most 2D data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby very slow compared to pandas 117039129

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 12.511ms · About: xarray-datasette