issues: 288785270
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
288785270 | MDU6SXNzdWUyODg3ODUyNzA= | 1832 | groupby on dask objects doesn't handle chunks well | 1197350 | closed | 0 | 22 | 2018-01-16T04:50:22Z | 2019-11-27T16:45:14Z | 2019-06-06T20:01:40Z | MEMBER | 80% of climate data analysis begins with calculating the monthly-mean climatology and subtracting it from the dataset to get an anomaly. Unfortunately this is a fail case for xarray / dask with out-of-core datasets. This is becoming a serious problem for me. Code Sample```python Your code hereimport xarray as xr import dask.array as da import pandas as pd construct an example datatset chunked in timent, ny, nx = 366, 180, 360 time = pd.date_range(start='1950-01-01', periods=nt, freq='10D') ds = xr.DataArray(da.random.random((nt, ny, nx), chunks=(1, ny, nx)), dims=('time', 'lat', 'lon'), coords={'time': time}).to_dataset(name='field') monthly climatologyds_mm = ds.groupby('time.month').mean(dim='time') anomalyds_anom = ds.groupby('time.month')- ds_mm
print(ds_anom)
Problem descriptionAs we can see in the example above, the chunking has been lost. The dataset contains just one single huge chunk. This happens with any non-reducing operation on the groupby, even
Say we wanted to compute some statistics of the anomaly, like the variance:
Expected OutputIt seems like we should be able to do this lazily, maintaining a chunk size of Output of
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/1832/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |