html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/6709#issuecomment-1164686511,https://api.github.com/repos/pydata/xarray/issues/6709,1164686511,IC_kwDOAMm_X85Fa7Sv,74916839,2022-06-23T17:33:48Z,2022-06-23T17:33:48Z,NONE,Thanks @gjoseph92 and @dcherian . I'll try the different approaches in the links you have provided to see if I can improve my current solution (I compute the fields separately which means more IO and more operations),"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1277437106
https://github.com/pydata/xarray/issues/6709#issuecomment-1164064469,https://api.github.com/repos/pydata/xarray/issues/6709,1164064469,IC_kwDOAMm_X85FYjbV,74916839,2022-06-23T07:40:33Z,2022-06-23T07:43:20Z,NONE,"Thanks for the tips, I was investigating inline_array=True and still no luck. The graph seems OK though. I can attach it if you want but I think zarr is not the culprit. 

Here is why :

In the first case, where we build the array from scratch, the ones array is simple. Dask seems to understand that it does not have to make many copies of it. So when replacing ones with random data, we observe the same behavior as opening the dataset from a ZARR store (high memory usage on a worker) :

```python
import dask.array as da
import numpy as np
import xarray as xr

ds = xr.Dataset(
    dict(
        anom_u=([""time"", ""face"", ""j"", ""i""], da.random.random((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
        anom_v=([""time"", ""face"", ""j"", ""i""], da.random.random((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
    )
)

ds[""anom_uu_mean""] = ([""face"", ""j"", ""i""], np.mean(ds.anom_u.data**2, axis=0))
ds[""anom_vv_mean""] = ([""face"", ""j"", ""i""], np.mean(ds.anom_v.data**2, axis=0))
ds[""anom_uv_mean""] = ([""face"", ""j"", ""i""], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0))

ds[[""anom_uu_mean"", ""anom_vv_mean"", ""anom_uv_mean""]].compute()
```

I think the question now is why Dask must load so many data when doing my operation :

![graph](https://user-images.githubusercontent.com/74916839/175235250-db086230-10e5-496c-94e8-9fb32f3c64dc.png)

If we take the computation graph (I've put the non optimized version), my understanding is that we could do the following :
- Load the first chunk of anom_u
- Load the second chunk of anom_v
- Do the multiplication anom_u*anom_v, anom_u**, anom_v ** 2
- Do the mean-chunk task
- Unload all the previous tasks
- Redo the same and combine the mean-chunks tasks

For information, one chunk is about 1.4G, so I expect see peaks of 5*1.4 = 7G in memory (plus what's needed to store the mean_chunk), but I instead got 15G+ in my worker, most of it taken by the random-samples

![image](https://user-images.githubusercontent.com/74916839/175243482-7bc86068-1beb-4be1-81b0-3100c1e07125.png)

Is my understanding of distributed mean wrong ? Why are the random-sample not flushed ?

","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1277437106
https://github.com/pydata/xarray/issues/6709#issuecomment-1162833362,https://api.github.com/repos/pydata/xarray/issues/6709,1162833362,IC_kwDOAMm_X85FT23S,74916839,2022-06-22T08:54:18Z,2022-06-22T09:10:04Z,NONE,"Hi @TomNicholas

I've reduced the original dataset to 11 chunks over the time dimension so that we can see the graph properly. I also replaced the .compute operation by a to_zarr(compute=False) because I don't know how to visualize xarray operations without generating a Delayed object (comments are welcomed on this point !)

Anyway here are the files, first one is the graph where the means are built from dask.ones arrays

![graph_no_zarr_source](https://user-images.githubusercontent.com/74916839/174985986-41313ac3-e8dc-43b8-96be-3a3ea1115237.png)

Second one is the graph where the means are built from the same arrays but opened from a zarr store

![graph_zarr_source](https://user-images.githubusercontent.com/74916839/174985918-45219be5-d748-4071-b765-88f479835d5c.png)

I am quite a newbie in dask graphs debug but everything seems ok in the second graph, apart from the open_dataset tasks that are linked to a parent task. Also, I noticed that Dask have fused the 'ones' operations in the first graph. Would it help if I generated another arrays with zeros instead ?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1277437106