id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1277437106,I_kwDOAMm_X85MJCSy,6709,Means of zarr arrays cause a memory overload in dask workers,74916839,closed,0,,,17,2022-06-20T22:24:39Z,2023-10-09T15:39:35Z,2023-10-09T15:39:35Z,NONE,,,,"### What is your issue? Hello everyone ! I am submitting this issue here but it is not entirely clear if my problem comes from xarray, dask or zarr. The goal here is to compute a mean from the GCM anomalies of SSH. The following simple code creates an artificial dataset (a variable is about 90G) with the anomaly fields, and compute the cross-products means. ```python import dask.array as da import numpy as np import xarray as xr ds = xr.Dataset( dict( anom_u=([""time"", ""face"", ""j"", ""i""], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), anom_v=([""time"", ""face"", ""j"", ""i""], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), ) ) ds[""anom_uu_mean""] = ([""face"", ""j"", ""i""], np.mean(ds.anom_u.data**2, axis=0)) ds[""anom_vv_mean""] = ([""face"", ""j"", ""i""], np.mean(ds.anom_v.data**2, axis=0)) ds[""anom_uv_mean""] = ([""face"", ""j"", ""i""], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0)) ds[[""anom_uu_mean"", ""anom_vv_mean"", ""anom_uv_mean""]].compute() ``` I was expecting a low memory usage because after using a single chunk of anom_u and anom_v to do a mean iteration, these two could be forgotten. The following figure checks that we are very low on memory usage so all is well. ![image](https://user-images.githubusercontent.com/74916839/174682620-b5c330d2-0fb2-43b7-b3fe-950ecdeec9a5.png) The matter becomes more complicated when the dataset is opened from a ZARR store. We simply dumped our previous articially generated data to a temporary store, and reloaded it : ```python import dask.array as da import numpy as np import xarray as xr ds = xr.Dataset( dict( anom_u=([""time"", ""face"", ""j"", ""i""], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), anom_v=([""time"", ""face"", ""j"", ""i""], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), ) ) store = ""/work/scratch/test_zarr_graph"" ds.to_zarr(store, compute=False, mode=""a"") ds = xr.open_zarr(store) ds[""anom_uu_mean""] = ([""face"", ""j"", ""i""], np.mean(ds.anom_u.data**2, axis=0)) ds[""anom_vv_mean""] = ([""face"", ""j"", ""i""], np.mean(ds.anom_v.data**2, axis=0)) ds[""anom_uv_mean""] = ([""face"", ""j"", ""i""], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0)) ds[[""anom_uu_mean"", ""anom_vv_mean"", ""anom_uv_mean""]].compute() ``` ![image](https://user-images.githubusercontent.com/74916839/174683111-dd6719b8-0d1b-467e-bdab-1c08bf8a1215.png) I was expecting a similar behavior between a dataset created from scratch and one created from a zarr store, but it seems not to be the case. I tried using inline_array=True with xr.open_dataset but to no avail. I also tried computing 2 variables instead of 3 and it works properly, so the behavior seems strange to me. Do you see any reason as to why I am seeing such memory load on my workers ? Here are the software version I use : xarray version : 2022.6.0rc0 dask version : 2022.04.1 zarr version : 2.11.1 numpy version : 1.21.6 ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6709/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue