home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1755610168

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1755610168 I_kwDOAMm_X85opHw4 7918 xarray lazy indexing/loading is not sufficiently documented 14314623 open 0     1 2023-06-13T20:35:38Z 2023-06-13T21:29:53Z   CONTRIBUTOR      

What is your issue?

The default behavior of opening up datasets lazily instead of loading them into memory urgently needs more documentation or more extensive linking of existing docs.

I have seen tons of example where the 'laziness' of the loading is not apparent to users.

The workflow commonly looks something like this: 1. Open some 'larger-than-memory' dataset, e.g. from a cloud bucket with xr.open_dataset('gs://your/bucket/store.zarr', engine='zarr') 2. Printing the dataset repr and seeing no indication of the dataset not being in memory (I am not sure if there is somewhere to check this that I might have missed?) 3. Layering some calculations on it, and at some point a calculation is triggered and blows up the users machine/kernel.4.

To start with, the docstring of open_dataset does not mention at all what is going to happen when the default chunks=None is used! This could be easily fixed if some more extensive text on the lazy loading exists. I was also not able to find any more descriptive docs on this feature, even though I might have missed something here.

Up until a chat I had with @TomNicholas today, I honestly did not understand why this feature even existed.

His explanation (below) was however very good, and if something similar is not in the docs yet, should probably be added.

When you open a dataset from disk (/zarr), often the first thing you want to do is concatenate it and subset it by indexing. (the concatenation may even happen automatically in open_mfdataset). If you do not have dask (or choose not to use dask), opening a file would load all its data into a numpy array. You might then want to concatenate with other numpy arrays from opening other files, but then subset to only some time steps (for example). If everything is immediately loaded as numpy arrays this would be extremely wasteful - you would load all these values into memory even though you're about to drop them by slicing. This is why xarray has internal lazy indexing classes - they lazily do the indexing without actually loading the data as numpy arrays. If instead you load with dask, then because dask does everything lazily, you get basically the same features. But the implementation of those features is completely different (dask's delayed array objects vs xarray's lazy indexing internal classes).

I think overall this is a giant pitfall, particularly for xarray beginners, and thus deserves some thought. While I am sure the choices made up to here might have some large functional upsides, I wonder three things:

  1. How can we improve the docs to at least make this behavior more obvious?
  2. Is the choice of setting the loading to lazy by default a required choice?
  3. Is there a way that we could indicate 'laziness' in the dataset repr? I do not know how to check whether a dataset is actually in memory at this point. Or maybe there is a clever way to raise a warning if the size of the dataset is larger than the system memory?

Happy to work on this, since it is very relevant for many members of projects I work with. I first wanted to check if there is some existing docs that I missed.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7918/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 238.536ms · About: xarray-datasette