home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 333312849

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
333312849 MDU6SXNzdWUzMzMzMTI4NDk= 2237 why time grouping doesn't preserve chunks 1197350 closed 0     30 2018-06-18T15:12:38Z 2022-05-15T02:44:06Z 2022-05-15T02:38:30Z MEMBER      

Code Sample, a copy-pastable example if possible

I am continuing my quest to obtain more efficient time grouping for calculation of climatologies and climatological anomalies. I believe this is one of the major performance bottlenecks facing xarray users today. I have raised this in other issues (e.g. #1832), but I believe I have narrowed it down here to a more specific problem.

The easiest way to summarize the problem is with an example. Consider the following dataset

python import xarray as xr ds = xr.Dataset({'foo': (['x'], [1, 1, 1, 1])}, coords={'x': (['x'], [0, 1, 2, 3]), 'bar': (['x'], ['a', 'a', 'b', 'b']), 'baz': (['x'], ['a', 'b', 'a', 'b'])}) ds = ds.chunk({'x': 2}) ds <xarray.Dataset> Dimensions: (x: 4) Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)> Data variables: foo (x) int64 dask.array<shape=(4,), chunksize=(2,)>

One non-dimension coordinate (bar) is contiguous with respect to x while the other baz is not. This is important. baz is structured similar to the way that month would be distributed on a timeseries dataset.

Now let's do a trivial groupby operation on bar that does nothing, just returns the group unchanged: python ds.foo.groupby('bar').apply(lambda x: x) <xarray.DataArray 'foo' (x: 4)> dask.array<shape=(4,), dtype=int64, chunksize=(2,)> Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)> This operation preserved this original chunks in foo. But if we group by baz we see something different python ds.foo.groupby('baz').apply(lambda x: x) <xarray.DataArray 'foo' (x: 4)> dask.array<shape=(4,), dtype=int64, chunksize=(4,)> Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>

Problem description

When grouping over a non-contiguous variable (baz) the result has no chunks. That means that we can't lazily access a single item without computing the whole array. This has major performance consequences that make it hard to calculate anomaly values in a more realistic case. What we really want to do is often something like ds = xr.open_mfdataset('lots/of/files/*.nc') ds_anom = ds.groupby('time.month').apply(lambda x: x - x.mean(dim='time) It is currently impossible to do this lazily due to the issue described above.

Expected Output

We would like to preserve the original chunk structure of foo.

Output of xr.show_versions()

xr.show_versions() is triggering a segfault right now on my system for unknown reasons! I am using xarray 0.10.7.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2237/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 30 rows from issue in issue_comments
Powered by Datasette · Queries took 0.546ms · About: xarray-datasette