home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 859218255

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
859218255 MDU6SXNzdWU4NTkyMTgyNTU= 5165 Poor memory management with dask=2021.4.0 39069044 closed 0     4 2021-04-15T20:19:05Z 2021-04-21T12:16:31Z 2021-04-21T10:17:40Z CONTRIBUTOR      

What happened: With the latest dask release 2021.4.0, there seems to be a regression in memory management that has broken some of my standard climate science workflows like the simple anomaly calculation below. Rather than intelligently handling chunks for independent time slices, the code below will now load the entire ~30GB x array into memory before writing to zarr.

What you expected to happen: Dask would intelligently manage chunks and not fill up memory. This works fine in 2021.3.0.

Minimal Complete Verifiable Example: Generate a synthetic dataset with time/lat/lon variable and associated climatology stored to disk, then calculate the anomaly: ```python import xarray as xr import pandas as pd import numpy as np import dask.array as da

dates = pd.date_range('1980-01-01', '2019-12-31', freq='D') ds = xr.Dataset( data_vars = { 'x':( ('time', 'lat', 'lon'), da.random.random(size=(dates.size, 360, 720), chunks=(1, -1, -1))), 'clim':( ('dayofyear', 'lat', 'lon'), da.random.random(size=(366, 360, 720), chunks=(1, -1, -1))), }, coords = { 'time': dates, 'dayofyear': np.arange(1, 367, 1), 'lat': np.arange(-90, 90, .5), 'lon': np.arange(-180, 180, .5), } )

My original use case was pulling this data from disk, but it doesn't actually seem to matter

ds.to_zarr('test-data', mode='w') ds = xr.open_zarr('test-data') ds['anom'] = ds.x.groupby('time.dayofyear') - ds.clim ds[['anom']].to_zarr('test-anom', mode='w') ```

Anything else we need to know?: Distributed vs local scheduler and file backend e.g. zarr vs netcdf don't seem to affect this.

Dask graphs look the same for both 2021.3.0: and 2021.4.0:

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Dec 26 2020, 05:05:16) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.8.0-48-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.17.1.dev52+ge5690588 pandas: 1.2.1 numpy: 1.19.5 scipy: 1.6.0 netCDF4: 1.5.5.1 pydap: None h5netcdf: 0.8.1 h5py: 2.10.0 Nio: None zarr: 2.6.1 cftime: 1.3.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.8 cfgrib: 0.9.8.5 iris: None bottleneck: 1.3.2 dask: 2021.04.0 distributed: 2021.04.0 matplotlib: 3.3.3 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.16.1 setuptools: 49.6.0.post20210108 pip: 20.3.3 conda: None pytest: None IPython: 7.20.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5165/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 4 rows from issue in issue_comments
Powered by Datasette · Queries took 0.672ms · About: xarray-datasette