home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1001197796

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1001197796 I_kwDOAMm_X847rRDk 5804 vectorized groupby binary ops 2448579 closed 0     1 2021-09-20T17:04:47Z 2022-03-29T07:11:28Z 2022-03-29T07:11:28Z MEMBER      

By switching to numpy_groupies we are vectorizing our groupby reductions. I think we can do the same for groupby's binary ops.

Here's an example array ``` python import numpy as np import xarray as xr

%load_ext memory_profiler

N = 4 * 2000 da = xr.DataArray( np.random.random((N, N)), dims=("x", "y"), coords={"labels": ("x", np.repeat(["a", "b", "c", "d", "e", "f", "g", "h"], repeats=N//8))}, ) ```

Consider this "anomaly" calculation, anomaly defined relative to the group mean

``` python def anom_current(da): grouped = da.groupby("labels") mean = grouped.mean() anom = grouped - mean return anom

```

With this approach, we loop over each group and apply the binary operation: https://github.com/pydata/xarray/blob/a1635d324753588e353e4e747f6058936fa8cf1e/xarray/core/computation.py#L502-L525

This saves some memory, but becomes slow for large number of groups.

We could instead do def anom_vectorized(da): mean = da.groupby("labels").mean() mean_expanded = mean.sel(labels=da.labels) anom = da - mean_expanded return anom

Now we are faster, but construct an extra array as big as the original array (I think this is an OK tradeoff). ``` %timeit anom_current(da)

1.4 s ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit anom_vectorized(da)

937 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```

(I haven't experimented with dask yet, so the following is just a theory).

I think the real benefit comes with dask. Depending on where the groups are located relative to chunking, we could end up creating a lot of tiny chunks by splitting up existing chunks. With the vectorized approach we can do better.

Ideally we would reindex the "mean" dask array with a numpy-array-of-repeated-ints such that the chunking of mean_expanded exactly matches the chunking of da along the grouped dimension.

~In practice, dask.array.take doesn't allow specifying "output chunks" so we'd end up chunking "mean_expanded" based on dask's automatic heuristics, and then rechunking again for the binary operation.~

Thoughts?

cc @rabernat

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5804/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 1 row from issue in issue_comments
Powered by Datasette · Queries took 0.611ms · About: xarray-datasette