id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
1001197796,I_kwDOAMm_X847rRDk,5804,vectorized groupby binary ops,2448579,closed,0,,,1,2021-09-20T17:04:47Z,2022-03-29T07:11:28Z,2022-03-29T07:11:28Z,MEMBER,,,,"By switching to `numpy_groupies` we are vectorizing our groupby reductions. I think we can do the same for groupby's binary ops.

Here's an example  array
``` python
import numpy as np
import xarray as xr

%load_ext memory_profiler

N = 4 * 2000
da = xr.DataArray(
    np.random.random((N, N)),
    dims=(""x"", ""y""),
    coords={""labels"": (""x"", np.repeat([""a"", ""b"", ""c"", ""d"", ""e"", ""f"", ""g"", ""h""], repeats=N//8))},
)
```

Consider this ""anomaly"" calculation, anomaly defined relative to the group mean

``` python
def anom_current(da):
    grouped = da.groupby(""labels"")
    mean = grouped.mean()
    anom = grouped - mean
    return anom

```

With this approach, we loop over each group and apply the binary operation:
https://github.com/pydata/xarray/blob/a1635d324753588e353e4e747f6058936fa8cf1e/xarray/core/computation.py#L502-L525

This saves some memory, but becomes slow for large number of groups.

We could instead do
```
def anom_vectorized(da):
    mean = da.groupby(""labels"").mean()
    mean_expanded = mean.sel(labels=da.labels)
    anom = da - mean_expanded
    return anom
```

Now we are faster, but construct an extra array as big as the original array (I think this is an OK tradeoff).
```
%timeit anom_current(da)

# 1.4 s ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit anom_vectorized(da)

# 937 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
------

(I haven't experimented with dask yet, so the following is just a theory).

I think the real benefit comes with dask. Depending on where the groups are located relative to chunking, we could end up creating a lot of tiny chunks by splitting up existing chunks. With the vectorized approach we can do better. 

Ideally we would reindex the ""mean"" dask array with a numpy-array-of-repeated-ints such that the chunking of `mean_expanded` exactly matches the chunking of `da` along the grouped dimension. 

~In practice, [dask.array.take](https://docs.dask.org/en/latest/_modules/dask/array/routines.html#take) doesn't allow specifying ""output chunks"" so we'd end up chunking ""mean_expanded"" based on dask's automatic heuristics, and then rechunking again for the binary operation.~

Thoughts?

cc @rabernat","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5804/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue