home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1987770706

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1987770706 I_kwDOAMm_X852evlS 8440 nD integer indexing on dask data is very slow 13662783 closed 0     2 2023-11-10T14:47:08Z 2023-11-12T04:56:23Z 2023-11-12T04:56:22Z CONTRIBUTOR      

What happened?

I ran into a situation where I was indexing with a 2D integer array into some chunked netCDF data. This indexing operation is extremely slow. Using a flat 1D index instead is as fast as expected.

What did you expect to happen?

I would expect indexing on dask data to be very quick since the work is delayed, and indeed it is so in the 1D case. However, the 2D case is very slow -- slower than actually doing the all the work with numpy arrays!

Minimal Complete Verifiable Example

```Python import dask.array import numpy as np import xarray as xr

%%

da = xr.DataArray( data=np.random.rand(100, 1_000_000), dims=("time", "x"), ) dask_da = xr.DataArray( data=dask.array.from_array(da.to_numpy(), chunks=(1, 1_000_000)), dims=("time", "x"), )

indexer = np.random.randint(0, 1_000_000, size=100_000) indexer2d = xr.DataArray( data=indexer.reshape((4, -1)), dims=("a", "b"), )

%%

%timeit da.isel(x=indexer) # 162 ms %timeit da.isel(x=indexer2d) # 164 ms %timeit dask_da.isel(x=indexer) # 5.3 ms %timeit dask_da.isel(x=indexer2d) # 860 ms according to timeit, but 6 to 14 (!) seconds in interactive use ```

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:34:57) [MSC v.1936 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 151 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('English_Netherlands', '1252') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.10.2.dev31+ge5d163a8.d20231110 pandas: 2.1.2 numpy: 1.26.0 scipy: 1.11.3 netCDF4: 1.6.5 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.3 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.10.1 distributed: 2023.10.1 matplotlib: 3.8.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.10.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.1.2 pip: 23.2.1 conda: 23.3.1 pytest: None mypy: None IPython: 8.17.2 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8440/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 0.737ms · About: xarray-datasette