home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 753517739

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
753517739 MDU6SXNzdWU3NTM1MTc3Mzk= 4625 Non lazy behavior for weighted average when using resampled data 14314623 closed 0     13 2020-11-30T14:19:48Z 2020-12-16T19:05:30Z 2020-12-16T19:05:30Z CONTRIBUTOR      

I am trying to apply an averaging function to multi year chunks of monthly model data. At the core the function performs a weighted average (and then some coordinate manipulations). I am using resample(time='1AS') and then try to map my custom function onto the data (see example below). Without actually loading the data, this step is prohibitively long in my workflow (20-30min depending on the model). Is there a way to apply this step completely lazily, like in the case where a simple non-weighted .mean() is used?

```python from dask.diagnostics import ProgressBar import xarray as xr import numpy as np

simple customized weighted mean function

def mean_func(ds): return ds.weighted(ds.weights).mean('time')

example dataset

t = xr.cftime_range(start='2000', periods=1000, freq='1AS') weights = xr.DataArray(np.random.rand(len(t)),dims=['time'], coords={'time':t}) data = xr.DataArray(np.random.rand(len(t)),dims=['time'], coords={'time':t, 'weights':weights}) ds = xr.Dataset({'data':data}).chunk({'time':1}) ds ```

Using resample with a simple mean works without any computation being triggered: python with ProgressBar(): ds.resample(time='3AS').mean('time')

But when I do the same step with my custom function, there are some computations showing up

python with ProgressBar(): ds.resample(time='3AS').map(mean_func) [########################################] | 100% Completed | 0.1s [########################################] | 100% Completed | 0.1s [########################################] | 100% Completed | 0.1s [########################################] | 100% Completed | 0.1s I am quite sure these are the same kind of computations that make my real-world workflow so slow.

I also confirmed that this not happening when I do not use resample first with ProgressBar(): mean_func(ds) this does not trigger a computation either. So this must be somehow related to resample? I would be happy to dig deeper into this, if somebody with more knowledge could point me to the right place.

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.2.dev77+g1a4f7bd pandas: 1.1.3 numpy: 1.19.2 scipy: 1.5.2 netCDF4: 1.5.4 pydap: None h5netcdf: 0.8.1 h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.2.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: None bottleneck: None dask: 2.30.0 distributed: 2.30.0 matplotlib: 3.3.2 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.16.1 setuptools: 49.6.0.post20201009 pip: 20.2.4 conda: None pytest: 6.1.2 IPython: 7.18.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4625/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 13 rows from issue in issue_comments
Powered by Datasette · Queries took 1.566ms · About: xarray-datasette