issue_comments: 1164064469

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/6709#issuecomment-1164064469	https://api.github.com/repos/pydata/xarray/issues/6709	1164064469	IC_kwDOAMm_X85FYjbV	74916839	2022-06-23T07:40:33Z	2022-06-23T07:43:20Z	NONE	Thanks for the tips, I was investigating inline_array=True and still no luck. The graph seems OK though. I can attach it if you want but I think zarr is not the culprit. Here is why : In the first case, where we build the array from scratch, the ones array is simple. Dask seems to understand that it does not have to make many copies of it. So when replacing ones with random data, we observe the same behavior as opening the dataset from a ZARR store (high memory usage on a worker) : ```python import dask.array as da import numpy as np import xarray as xr ds = xr.Dataset( dict( anom_u=(["time", "face", "j", "i"], da.random.random((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), anom_v=(["time", "face", "j", "i"], da.random.random((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), ) ) ds["anom_uu_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data2, axis=0)) ds["anom_vv_mean"] = (["face", "j", "i"], np.mean(ds.anom_v.data2, axis=0)) ds["anom_uv_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0)) ds[["anom_uu_mean", "anom_vv_mean", "anom_uv_mean"]].compute() ``` I think the question now is why Dask must load so many data when doing my operation : If we take the computation graph (I've put the non optimized version), my understanding is that we could do the following : - Load the first chunk of anom_u - Load the second chunk of anom_v - Do the multiplication anom_uanom_v, anom_u, anom_v * 2 - Do the mean-chunk task - Unload all the previous tasks - Redo the same and combine the mean-chunks tasks For information, one chunk is about 1.4G, so I expect see peaks of 5*1.4 = 7G in memory (plus what's needed to store the mean_chunk), but I instead got 15G+ in my worker, most of it taken by the random-samples Is my understanding of distributed mean wrong ? Why are the random-sample not flushed ?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		1277437106