home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 156915727

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/659#issuecomment-156915727 https://api.github.com/repos/pydata/xarray/issues/659 156915727 MDEyOklzc3VlQ29tbWVudDE1NjkxNTcyNw== 1217238 2015-11-16T04:57:24Z 2015-11-16T04:57:24Z MEMBER

Yes, I'm afraid this is a known issue. Grouped aggregations are currently implemented with a loop in pure Python, which, of course, is pretty slow.

I've done some exploratory work to rewrite them in Numba, which shows some encouraging preliminary results:

``` from numba import guvectorize, jit import pandas as pd import numpy as np

@guvectorize(['(float64[:], int64[:], float64[:])'], '(x),(x),(y)', nopython=True) def _grouped_mean(values, int_labels, target): count = np.zeros(len(target), np.int64) for i in range(len(values)): val = values[i] if not np.isnan(val): lab = int_labels[i] target[lab] += val count[lab] += 1 target /= count

def move_axis_to_end(array, axis): array = np.asarray(array) return np.rollaxis(array, axis, start=array.ndim)

def grouped_mean(values, by, axis=-1): int_labels, uniques = pd.factorize(by, sort=True) values = move_axis_to_end(values, axis) target = np.zeros(values.shape[:-1] + uniques.shape) _grouped_mean(values, int_labels, target) return target, uniques

values = np.random.RandomState(0).rand(int(1e6)) values[::50] = np.nan by = np.random.randint(50, size=int(1e6)) df = pd.DataFrame({'x': values, 'y': by})

np.testing.assert_allclose(grouped_mean(values, by)[0], df.groupby('y')['x'].mean())

%timeit grouped_mean(values, by) # 100 loops, best of 3: 15.3 ms per loop %timeit df.groupby('y').mean() # 10 loops, best of 3: 21.4 ms per loop ```

Unfortunately, I'm unlikely to have time to work on this in the near future. If you or anyone else is interested in taking the lead on this, it would be greatly appreciated!

Note that we can't reuse the routines from pandas because they are only designed for 1D or at most 2D data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  117039129
Powered by Datasette · Queries took 1.919ms · About: xarray-datasette