id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
2136709010,I_kwDOAMm_X85_W5eS,8753,Lazy Loading with `DataArray` vs. `Variable`,2448579,closed,0,,,0,2024-02-15T14:42:24Z,2024-04-04T16:46:54Z,2024-04-04T16:46:54Z,MEMBER,,,,"### Discussed in https://github.com/pydata/xarray/discussions/8751
Originally posted by **ilan-gold** February 15, 2024
My goal is to get a dataset from [custom io-zarr backend lazy-loaded](https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html#how-to-support-lazy-loading). But when I declare a `DataArray` based on the `Variable` which uses `LazilyIndexedArray`, everything is read in. Is this expected? I specifically don't want to have to use dask if possible. I have seen https://github.com/aurghs/xarray-backend-tutorial/blob/main/2.Backend_with_Lazy_Loading.ipynb but it's a little bit different.
While I have a custom backend array inheriting from `ZarrArrayWrapper`, this example using `ZarrArrayWrapper` directly still highlights the same unexpected behavior of everything being read in.
```python
import zarr
import xarray as xr
from tempfile import mkdtemp
import numpy as np
from pathlib import Path
from collections import defaultdict
class AccessTrackingStore(zarr.DirectoryStore):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._access_count = {}
self._accessed = defaultdict(set)
def __getitem__(self, key):
for tracked in self._access_count:
if tracked in key:
self._access_count[tracked] += 1
self._accessed[tracked].add(key)
return super().__getitem__(key)
def get_access_count(self, key):
return self._access_count[key]
def set_key_trackers(self, keys_to_track):
if isinstance(keys_to_track, str):
keys_to_track = [keys_to_track]
for k in keys_to_track:
self._access_count[k] = 0
def get_subkeys_accessed(self, key):
return self._accessed[key]
orig_path = Path(mkdtemp())
z = zarr.group(orig_path / ""foo.zarr"")
z['array'] = np.random.randn(1000, 1000)
store = AccessTrackingStore(orig_path / ""foo.zarr"")
store.set_key_trackers(['array'])
z = zarr.group(store)
arr = xr.backends.zarr.ZarrArrayWrapper(z['array'])
lazy_arr = xr.core.indexing.LazilyIndexedArray(arr)
# just `.zarray`
var = xr.Variable(('x', 'y'), lazy_arr)
print('Variable read in ', store.get_subkeys_accessed('array'))
# now everything is read in
da = xr.DataArray(var)
print('DataArray read in ', store.get_subkeys_accessed('array'))
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8753/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
2136724736,PR_kwDOAMm_X85m_MtN,8754,Don't access data when creating DataArray from Variable.,2448579,closed,0,,,2,2024-02-15T14:48:32Z,2024-04-04T16:46:54Z,2024-04-04T16:46:53Z,MEMBER,,0,pydata/xarray/pulls/8754,"
- [x] Closes #8753
This seems to have been around since 2016-ish, so presumably our backend code path is passing arrays around, not Variables.
cc @ilan-gold","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8754/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull