issues: 1755610168

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1755610168	I_kwDOAMm_X85opHw4	7918	xarray lazy indexing/loading is not sufficiently documented	14314623	open	0			1	2023-06-13T20:35:38Z	2023-06-13T21:29:53Z		CONTRIBUTOR				What is your issue? The default behavior of opening up datasets lazily instead of loading them into memory urgently needs more documentation or more extensive linking of existing docs. I have seen tons of example where the 'laziness' of the loading is not apparent to users. The workflow commonly looks something like this: 1. Open some 'larger-than-memory' dataset, e.g. from a cloud bucket with `xr.open_dataset('gs://your/bucket/store.zarr', engine='zarr')` 2. Printing the dataset repr and seeing no indication of the dataset not being in memory (I am not sure if there is somewhere to check this that I might have missed?) 3. Layering some calculations on it, and at some point a calculation is triggered and blows up the users machine/kernel.4. To start with, the docstring of `open_dataset` does not mention at all what is going to happen when the default `chunks=None` is used! This could be easily fixed if some more extensive text on the lazy loading exists. I was also not able to find any more descriptive docs on this feature, even though I might have missed something here. Up until a chat I had with @TomNicholas today, I honestly did not understand why this feature even existed. His explanation (below) was however very good, and if something similar is not in the docs yet, should probably be added. When you open a dataset from disk (/zarr), often the first thing you want to do is concatenate it and subset it by indexing. (the concatenation may even happen automatically in open_mfdataset). If you do not have dask (or choose not to use dask), opening a file would load all its data into a numpy array. You might then want to concatenate with other numpy arrays from opening other files, but then subset to only some time steps (for example). If everything is immediately loaded as numpy arrays this would be extremely wasteful - you would load all these values into memory even though you're about to drop them by slicing. This is why xarray has internal lazy indexing classes - they lazily do the indexing without actually loading the data as numpy arrays. If instead you load with dask, then because dask does everything lazily, you get basically the same features. But the implementation of those features is completely different (dask's delayed array objects vs xarray's lazy indexing internal classes). I think overall this is a giant pitfall, particularly for xarray beginners, and thus deserves some thought. While I am sure the choices made up to here might have some large functional upsides, I wonder three things: How can we improve the docs to at least make this behavior more obvious? Is the choice of setting the loading to lazy by default a required choice? Is there a way that we could indicate 'laziness' in the dataset repr? I do not know how to check whether a dataset is actually in memory at this point. Or maybe there is a clever way to raise a warning if the size of the dataset is larger than the system memory? Happy to work on this, since it is very relevant for many members of projects I work with. I first wanted to check if there is some existing docs that I missed.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7918/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			13221727	issue

Links from other tables

3 rows from issues_id in issues_labels
0 rows from issue in issue_comments