issues: 1001197796
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1001197796 | I_kwDOAMm_X847rRDk | 5804 | vectorized groupby binary ops | 2448579 | closed | 0 | 1 | 2021-09-20T17:04:47Z | 2022-03-29T07:11:28Z | 2022-03-29T07:11:28Z | MEMBER | By switching to Here's an example array ``` python import numpy as np import xarray as xr %load_ext memory_profiler N = 4 * 2000 da = xr.DataArray( np.random.random((N, N)), dims=("x", "y"), coords={"labels": ("x", np.repeat(["a", "b", "c", "d", "e", "f", "g", "h"], repeats=N//8))}, ) ``` Consider this "anomaly" calculation, anomaly defined relative to the group mean ``` python def anom_current(da): grouped = da.groupby("labels") mean = grouped.mean() anom = grouped - mean return anom ``` With this approach, we loop over each group and apply the binary operation: https://github.com/pydata/xarray/blob/a1635d324753588e353e4e747f6058936fa8cf1e/xarray/core/computation.py#L502-L525 This saves some memory, but becomes slow for large number of groups. We could instead do
Now we are faster, but construct an extra array as big as the original array (I think this is an OK tradeoff). ``` %timeit anom_current(da) 1.4 s ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit anom_vectorized(da) 937 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)```(I haven't experimented with dask yet, so the following is just a theory). I think the real benefit comes with dask. Depending on where the groups are located relative to chunking, we could end up creating a lot of tiny chunks by splitting up existing chunks. With the vectorized approach we can do better. Ideally we would reindex the "mean" dask array with a numpy-array-of-repeated-ints such that the chunking of ~In practice, dask.array.take doesn't allow specifying "output chunks" so we'd end up chunking "mean_expanded" based on dask's automatic heuristics, and then rechunking again for the binary operation.~ Thoughts? cc @rabernat |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/5804/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |