home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1181573623

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1181573623 I_kwDOAMm_X85GbWH3 6413 Lazy label-based .isel() using 1-D Boolean array, followed by .load() is very slow 28287009 closed 0     1 2022-03-26T07:31:00Z 2023-11-06T06:10:01Z 2023-11-06T06:10:00Z NONE      

What is your issue?

Info about my dataset

I have a large (~20 GB, ~27,000 x ~300,000, int16) netCDF-4 file written to disk incrementally along the first (unlimited) dimension without using dask (using code adapted from this comment). The DataArray stored in this file also has ~50 coordinates along the first dimension and ~300 coordinates along the second dimension.

Trying to load a subset of the data into memory

I have a 1D Boolean mask my_mask (with ~15,000 True) along the second dimension of the array that I'd like to use to index my array. When I do the following, the operation is very slow (I haven't seen it complete):

python import xarray as xr x = xr.open_dataarray(path_to_file) x = x.isel({"second_dim": my_mask}) x = x.load()

However, I can load the entire array and then index (this is slow-ish, but works):

python import xarray as xr x = xr.load_dataarray(path_to_file) x = x.isel({"second_dim": my_mask})

Is this vectorized indexing?

I'm not sure if this is expected behavior: according to the Tip here in the User Guide, indexing is slow when using vectorized indexing, which I assumed to mean indexing along multiple dimensions (outer indexing, in numpy parlance). Is indexing using a 1D Boolean mask (or equivalently a 1D integer array) also slow?

What to do for larger datasets that don't fit in RAM?

Right now, I can load and then isel because my array fits in RAM. I have other datasets that don't fit in RAM: how would you recommend I load a subset of such data from disk?

In the event that I have to use dask, I will be writing along the first dimension (and hence chunking along that dimension, probably), and reading along the second dimension: is that going to be efficient (or at least more efficient than whatever xarray is doing sans dask)?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6413/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  not_planned 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 0.796ms · About: xarray-datasette