issues: 333312849

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
333312849	MDU6SXNzdWUzMzMzMTI4NDk=	2237	why time grouping doesn't preserve chunks	1197350	closed	0			30	2018-06-18T15:12:38Z	2022-05-15T02:44:06Z	2022-05-15T02:38:30Z	MEMBER				Code Sample, a copy-pastable example if possible I am continuing my quest to obtain more efficient time grouping for calculation of climatologies and climatological anomalies. I believe this is one of the major performance bottlenecks facing xarray users today. I have raised this in other issues (e.g. #1832), but I believe I have narrowed it down here to a more specific problem. The easiest way to summarize the problem is with an example. Consider the following dataset `python import xarray as xr ds = xr.Dataset({'foo': (['x'], [1, 1, 1, 1])}, coords={'x': (['x'], [0, 1, 2, 3]), 'bar': (['x'], ['a', 'a', 'b', 'b']), 'baz': (['x'], ['a', 'b', 'a', 'b'])}) ds = ds.chunk({'x': 2}) ds` `<xarray.Dataset> Dimensions: (x: 4) Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)> Data variables: foo (x) int64 dask.array<shape=(4,), chunksize=(2,)>` One non-dimension coordinate (`bar`) is contiguous with respect to `x` while the other `baz` is not. This is important. `baz` is structured similar to the way that `month` would be distributed on a timeseries dataset. Now let's do a trivial groupby operation on `bar` that does nothing, just returns the group unchanged: `python ds.foo.groupby('bar').apply(lambda x: x)` `<xarray.DataArray 'foo' (x: 4)> dask.array<shape=(4,), dtype=int64, chunksize=(2,)> Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>` This operation preserved this original chunks in `foo`. But if we group by `baz` we see something different `python ds.foo.groupby('baz').apply(lambda x: x)` `<xarray.DataArray 'foo' (x: 4)> dask.array<shape=(4,), dtype=int64, chunksize=(4,)> Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>` Problem description When grouping over a non-contiguous variable (`baz`) the result has no chunks. That means that we can't lazily access a single item without computing the whole array. This has major performance consequences that make it hard to calculate anomaly values in a more realistic case. What we really want to do is often something like `ds = xr.open_mfdataset('lots/of/files/*.nc') ds_anom = ds.groupby('time.month').apply(lambda x: x - x.mean(dim='time)` It is currently impossible to do this lazily due to the issue described above. Expected Output We would like to preserve the original chunk structure of `foo`. Output of `xr.show_versions()` `xr.show_versions()` is triggering a segfault right now on my system for unknown reasons! I am using xarray 0.10.7.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2237/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

1 row from issues_id in issues_labels
30 rows from issue in issue_comments