issues: 969079775
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
969079775 | MDU6SXNzdWU5NjkwNzk3NzU= | 5701 | Performance issues using map_blocks with cftime indexes. | 20629530 | open | 0 | 1 | 2021-08-12T15:47:29Z | 2022-04-19T02:44:37Z | CONTRIBUTOR | What happened:
When using What you expected to happen: I would understand a performance difference since numpy/pandas objects are usually more optimized than cftime/xarray objects, but the difference is quite large here. Minimal Complete Verifiable Example: Here is a MCVE that I ran in a jupyter notebook. Performance is basically measured by execution time (wall time). I included the current workaround I have for my usecase. ```python import numpy as np import pandas as pd import xarray as xr import dask.array as da from dask.distributed import Client c = Client(n_workers=1, threads_per_worker=8) Test DataNt = 10_000 Nx = Ny = 100 chks = (Nt, 10, 10) A = xr.DataArray( da.zeros((Nt, Ny, Nx), chunks=chks), dims=('time', 'y', 'x'), coords={'time': pd.date_range('1900-01-01', freq='D', periods=Nt), 'x': np.arange(Nx), 'y': np.arange(Ny) }, name='data' ) Copy of a, but with a cftime coordinateB = A.copy() B['time'] = xr.cftime_range('1900-01-01', freq='D', periods=Nt, calendar='noleap') A dumb function to applydef func(data): return data + data Test 1 : numpy-backed time coordinate%time outA = A.map_blocks(func, template=A) # %time outA.load(); Res on my machine:CPU times: user 130 ms, sys: 6.87 ms, total: 136 msWall time: 127 msCPU times: user 3.01 s, sys: 8.09 s, total: 11.1 sWall time: 13.4 sTest 2 : cftime-backed time coordinate%time outB = B.map_blocks(func, template=B) %time outB.load(); Res on my machineCPU times: user 4.42 s, sys: 219 ms, total: 4.64 sWall time: 4.48 sCPU times: user 13.2 s, sys: 3.1 s, total: 16.3 sWall time: 26 sWorkaround in my codedef func_cf(data): data['time'] = xr.decode_cf(data.coords.to_dataset()).time return data + data def map_blocks_cf(func, data): data2 = data.copy() data2['time'] = xr.conventions.encode_cf_variable(data.time) return data2.map_blocks(func, template=data) Test 3 : cftime time coordinate with encoding-decoding%time outB2 = map_blocks_cf(func_cf, B) %time outB2.load(); ResCPU times: user 536 ms, sys: 10.5 ms, total: 546 msWall time: 528 msCPU times: user 9.57 s, sys: 2.23 s, total: 11.8 sWall time: 21.7 s``` Anything else we need to know?:
After exploration I found 2 culprits for this slowness. I used
My workaround is not the best, but it was easy to code without touching xarray. The encoding of the time coordinate changes it to an integer array, which is super fast to tokenize. And the speed up of the construction phase is because there is only one call to I do not know for sure how/why this tokenization works, but I guess the best improvment in xarray could be to:
- Look into the inputs of I have no idea if that would work, but if it does that would be the best speed-up I think. Environment: Output of <tt>xr.show_versions()</tt>INSTALLED VERSIONS ------------------ commit: None python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:39:48) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-514.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: ('en_CA', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.19.1.dev18+g4bb9d9c.d20210810 pandas: 1.3.1 numpy: 1.21.1 scipy: 1.7.0 netCDF4: 1.5.6 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.8.3 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.07.1 distributed: 2021.07.1 matplotlib: 3.4.2 cartopy: 0.19.0 seaborn: None numbagg: None pint: 0.17 setuptools: 49.6.0.post20210108 pip: 21.2.1 conda: None pytest: None IPython: 7.25.0 sphinx: None |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/5701/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
13221727 | issue |