issues: 288785270

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
288785270	MDU6SXNzdWUyODg3ODUyNzA=	1832	groupby on dask objects doesn't handle chunks well	1197350	closed	0			22	2018-01-16T04:50:22Z	2019-11-27T16:45:14Z	2019-06-06T20:01:40Z	MEMBER				80% of climate data analysis begins with calculating the monthly-mean climatology and subtracting it from the dataset to get an anomaly. Unfortunately this is a fail case for xarray / dask with out-of-core datasets. This is becoming a serious problem for me. Code Sample ```python Your code here import xarray as xr import dask.array as da import pandas as pd construct an example datatset chunked in time nt, ny, nx = 366, 180, 360 time = pd.date_range(start='1950-01-01', periods=nt, freq='10D') ds = xr.DataArray(da.random.random((nt, ny, nx), chunks=(1, ny, nx)), dims=('time', 'lat', 'lon'), coords={'time': time}).to_dataset(name='field') monthly climatology ds_mm = ds.groupby('time.month').mean(dim='time') anomaly ds_anom = ds.groupby('time.month')- ds_mm print(ds_anom) <xarray.Dataset> Dimensions: (lat: 180, lon: 360, time: 366) Coordinates: * time (time) datetime64[ns] 1950-01-01 1950-01-11 1950-01-21 ... month (time) int64 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 5 6 6 6 7 7 7 8 8 8 ... Dimensions without coordinates: lat, lon Data variables: field (time, lat, lon) float64 dask.array<shape=(366, 180, 360), chunksize=(366, 180, 360)> ``` Problem description As we can see in the example above, the chunking has been lost. The dataset contains just one single huge chunk. This happens with any non-reducing operation on the groupby, even `python ds.groupby('time.month').apply(lambda x: x)` Say we wanted to compute some statistics of the anomaly, like the variance: `python (ds_anom.field**2).mean(dim='time').load()` This triggers the whole big chunk (with the whole timeseries) to be loaded into memory somewhere. For out-of-core datasets, this will crash our system. Expected Output It seems like we should be able to do this lazily, maintaining a chunk size of `(1, 180, 360)` for ds_anom. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0+dev27.g049cbdd pandas: 0.20.3 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.3.1 h5netcdf: 0.4.1 Nio: None zarr: 2.2.0a2.dev91 bottleneck: 1.2.1 cyordereddict: None dask: 0.16.0 distributed: 1.20.1 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 36.3.0 pip: 9.0.1 conda: None pytest: 3.2.1 IPython: 6.1.0 sphinx: 1.6.5 Possibly related to #392. cc @mrocklin	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1832/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

2 rows from issues_id in issues_labels
22 rows from issue in issue_comments