home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 834972299

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
834972299 MDU6SXNzdWU4MzQ5NzIyOTk= 5054 Fancy indexing a Dataset with dask DataArray causes excessive memory usage 703554 closed 0     3 2021-03-18T15:45:08Z 2021-03-18T16:58:44Z 2021-03-18T16:04:36Z CONTRIBUTOR      

I have a dataset comprising several variables. All variables are dask arrays (e.g., backed by zarr). I would like to use one of these variables, which is a 1d boolean array, to index the other variables along a large single dimension. The boolean indexing array is about ~40 million items long, with ~20 million true values.

If I do this all via dask (i.e., not using xarray) then I can index one dask array with another dask array via fancy indexing. The indexing array is not loaded into memory or computed. If I need to know the shape and chunks of the resulting arrays I can call compute_chunk_sizes(), but still very little memory is required.

If I do this via xarray.Dataset.isel() then a substantial amount of memory (several GB) is allocated during isel() and retained. This is problematic as in a real-world use case there are many arrays to be indexed and memory runs out on standard systems.

There is a follow-on issue which is if I then want to run a computation over one of the indexed arrays, if the indexing was done via xarray then that leads to a further blow-up of multiple GB of memory usage, if using dask distributed cluster.

I think the underlying issue here is that the indexing array is loaded into memory, and then gets copied multiple times when the dask graph is constructed. If using a distributed scheduler, further copies get made during scheduling of any subsequent computation.

I made a notebook which illustrates the increased memory usage during Dataset.isel() here:

https://colab.research.google.com/drive/1bn7Sj0An7TehwltWizU8j_l2OvPeoJyo?usp=sharing

This is possibly the same underlying issue (and use case) as raised by @eric-czech in #4663, so feel free to close this if you think it's a duplicate.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5054/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 3 rows from issue in issue_comments
Powered by Datasette · Queries took 0.638ms · About: xarray-datasette