issues: 1575938277
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1575938277 | I_kwDOAMm_X85d7ujl | 7516 | Dataset.where performances regression. | 1492047 | open | 0 | 10 | 2023-02-08T11:19:34Z | 2023-05-03T12:58:14Z | CONTRIBUTOR | What happened?Hello, I'm using the Dataset.where function to select data based on some fields values and it takes way to much time! The dask dashboard seems to show some tasks repeating themselves many times. The provided example uses a 1D array for which the selection could be done with Dataset.sel but with our real usecase we make selections on 2D variables. This problem seems to have appeared with the 2022.6.0 xarray release, the 2022.3.0 is working as expected. What did you expect to happen?Using the 2022.3 release, this selection takes 1.37 seconds. Using the 2022.6.0 up to the 2023.2.0 (the one from yesterday), this selection takes 8.47 seconds. This example is a very simple and small one, with real data and use case we simply cannot use this function anymore. Minimal Complete Verifiable Example```Python import dask.array as da import distributed as dist import xarray as xr client = dist.Client() Using small chunks emphasis the problemds = xr.Dataset( {"field": xr.DataArray(data=da.empty(shape=10000, chunks=10), dims=("x"))} ) sel = ds["field"] > 0 ds.where(sel, drop=True) ``` MVCE confirmation
Relevant log outputNo response Anything else we need to know?No response EnvironmentProblematic version
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('fr_FR', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1
xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.8.1
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: 0.9.10.3
iris: None
bottleneck: None
dask: 2023.1.1
distributed: 2023.1.1
matplotlib: 3.6.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: 0.20.1
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.1.0
pip: 23.0
conda: 22.11.1
pytest: 7.2.1
mypy: None
IPython: 8.7.0
sphinx: 5.3.0
Working version
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('fr_FR', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1
xarray: 2022.3.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.8.1
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: 0.9.10.3
iris: None
bottleneck: None
dask: 2023.1.1
distributed: 2023.1.1
matplotlib: 3.6.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: 0.20.1
sparse: None
setuptools: 67.1.0
pip: 23.0
conda: 22.11.1
pytest: 7.2.1
IPython: 8.7.0
sphinx: 5.3.0
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/7516/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
reopened | 13221727 | issue |