home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 288785270

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
288785270 MDU6SXNzdWUyODg3ODUyNzA= 1832 groupby on dask objects doesn't handle chunks well 1197350 closed 0     22 2018-01-16T04:50:22Z 2019-11-27T16:45:14Z 2019-06-06T20:01:40Z MEMBER      

80% of climate data analysis begins with calculating the monthly-mean climatology and subtracting it from the dataset to get an anomaly. Unfortunately this is a fail case for xarray / dask with out-of-core datasets. This is becoming a serious problem for me.

Code Sample

```python

Your code here

import xarray as xr import dask.array as da import pandas as pd

construct an example datatset chunked in time

nt, ny, nx = 366, 180, 360 time = pd.date_range(start='1950-01-01', periods=nt, freq='10D') ds = xr.DataArray(da.random.random((nt, ny, nx), chunks=(1, ny, nx)), dims=('time', 'lat', 'lon'), coords={'time': time}).to_dataset(name='field')

monthly climatology

ds_mm = ds.groupby('time.month').mean(dim='time')

anomaly

ds_anom = ds.groupby('time.month')- ds_mm print(ds_anom) <xarray.Dataset> Dimensions: (lat: 180, lon: 360, time: 366) Coordinates: * time (time) datetime64[ns] 1950-01-01 1950-01-11 1950-01-21 ... month (time) int64 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 5 6 6 6 7 7 7 8 8 8 ... Dimensions without coordinates: lat, lon Data variables: field (time, lat, lon) float64 dask.array<shape=(366, 180, 360), chunksize=(366, 180, 360)> ```

Problem description

As we can see in the example above, the chunking has been lost. The dataset contains just one single huge chunk. This happens with any non-reducing operation on the groupby, even python ds.groupby('time.month').apply(lambda x: x)

Say we wanted to compute some statistics of the anomaly, like the variance: python (ds_anom.field**2).mean(dim='time').load() This triggers the whole big chunk (with the whole timeseries) to be loaded into memory somewhere. For out-of-core datasets, this will crash our system.

Expected Output

It seems like we should be able to do this lazily, maintaining a chunk size of (1, 180, 360) for ds_anom.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0+dev27.g049cbdd pandas: 0.20.3 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.3.1 h5netcdf: 0.4.1 Nio: None zarr: 2.2.0a2.dev91 bottleneck: 1.2.1 cyordereddict: None dask: 0.16.0 distributed: 1.20.1 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 36.3.0 pip: 9.0.1 conda: None pytest: 3.2.1 IPython: 6.1.0 sphinx: 1.6.5

Possibly related to #392.

cc @mrocklin

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1832/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 22 rows from issue in issue_comments
Powered by Datasette · Queries took 0.763ms · About: xarray-datasette