issue_comments: 850542572

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/pull/5390#issuecomment-850542572	https://api.github.com/repos/pydata/xarray/issues/5390	850542572	MDEyOklzc3VlQ29tbWVudDg1MDU0MjU3Mg==	5700886	2021-05-28T16:45:55Z	2021-05-28T16:45:55Z	CONTRIBUTOR	@AndrewWilliams3142 @dcherian Looks like I broke the first Gist. :( Your Example above does not quite get there, because the `xr.DataArray(np...).chunk()` just leads to one chunk per data array. Here's a Gist that explains the idea for the correlations: https://nbviewer.jupyter.org/gist/willirath/c5c5274f31c98e8452548e8571158803 With ```python X = xr.DataArray( darr.random.normal(size=array_size, chunks=chunk_size), dims=("t", "y", "x"), name="X", ) Y = xr.DataArray( darr.random.normal(size=array_size, chunks=chunk_size), dims=("t", "y", "x"), name="Y", ) `the "bad" / explicit way of calculating the correlation`python corr_exp = ((X - X.mean("t")) * (Y - Y.mean("t"))).mean("t") ``` leads to a graph like this: Dask won't release any of the tasks defining `X` and `Y` until the marked `sub`straction tasks are done. The "good" / aggregating way of calculting the correlation `python corr_agg = (X * Y).mean("t") - X.mean("t") * Y.mean("t")` has the following graph where the marked `mul`tiplication and `mean_chunk` tasks are acting on only pairs of chunks and individual chunks and then release the original chunks of `X` and `Y`. This graph can be evaluated with a much smaller memory foot print than the other one. (It's not certain that this is always leading to lower memory use, however. But this is a different issue ...)	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 1 }		904153867