issues: 1277437106

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1277437106	I_kwDOAMm_X85MJCSy	6709	Means of zarr arrays cause a memory overload in dask workers	74916839	closed	0			17	2022-06-20T22:24:39Z	2023-10-09T15:39:35Z	2023-10-09T15:39:35Z	NONE				What is your issue? Hello everyone ! I am submitting this issue here but it is not entirely clear if my problem comes from xarray, dask or zarr. The goal here is to compute a mean from the GCM anomalies of SSH. The following simple code creates an artificial dataset (a variable is about 90G) with the anomaly fields, and compute the cross-products means. ```python import dask.array as da import numpy as np import xarray as xr ds = xr.Dataset( dict( anom_u=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), anom_v=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), ) ) ds["anom_uu_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data2, axis=0)) ds["anom_vv_mean"] = (["face", "j", "i"], np.mean(ds.anom_v.data2, axis=0)) ds["anom_uv_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0)) ds[["anom_uu_mean", "anom_vv_mean", "anom_uv_mean"]].compute() ``` I was expecting a low memory usage because after using a single chunk of anom_u and anom_v to do a mean iteration, these two could be forgotten. The following figure checks that we are very low on memory usage so all is well. The matter becomes more complicated when the dataset is opened from a ZARR store. We simply dumped our previous articially generated data to a temporary store, and reloaded it : ```python import dask.array as da import numpy as np import xarray as xr ds = xr.Dataset( dict( anom_u=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), anom_v=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), ) ) store = "/work/scratch/test_zarr_graph" ds.to_zarr(store, compute=False, mode="a") ds = xr.open_zarr(store) ds["anom_uu_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data2, axis=0)) ds["anom_vv_mean"] = (["face", "j", "i"], np.mean(ds.anom_v.data2, axis=0)) ds["anom_uv_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0)) ds[["anom_uu_mean", "anom_vv_mean", "anom_uv_mean"]].compute() ``` I was expecting a similar behavior between a dataset created from scratch and one created from a zarr store, but it seems not to be the case. I tried using inline_array=True with xr.open_dataset but to no avail. I also tried computing 2 variables instead of 3 and it works properly, so the behavior seems strange to me. Do you see any reason as to why I am seeing such memory load on my workers ? Here are the software version I use : xarray version : 2022.6.0rc0 dask version : 2022.04.1 zarr version : 2.11.1 numpy version : 1.21.6	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6709/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

3 rows from issues_id in issues_labels
10 rows from issue in issue_comments