html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/pull/5390#issuecomment-850820173,https://api.github.com/repos/pydata/xarray/issues/5390,850820173,MDEyOklzc3VlQ29tbWVudDg1MDgyMDE3Mw==,5700886,2021-05-29T11:51:50Z,2021-05-29T11:51:59Z,CONTRIBUTOR,"I think the problem with > `cov = _mean(da_a * da_b) - da_a.mean(dim=dim) * da_b.mean(dim=dim)` is that the `da_a.mean()` and the `da_b.mean()` calls don't know about each other's missing data. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867 https://github.com/pydata/xarray/pull/5390#issuecomment-850819741,https://api.github.com/repos/pydata/xarray/issues/5390,850819741,MDEyOklzc3VlQ29tbWVudDg1MDgxOTc0MQ==,5700886,2021-05-29T11:48:02Z,2021-05-29T11:48:02Z,CONTRIBUTOR,"Shouldn't the following do? ```python cov = ( (da_a * da_b).mean(dim) - ( da_a.where(da_b.notnull()).mean(dim) * da_b.where(da_a.notnull()).mean(dim) ) ) ``` (See here: https://nbviewer.jupyter.org/gist/willirath/cfaa8fb1b53fcb8dcb05ddde839c794c )","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867 https://github.com/pydata/xarray/pull/5390#issuecomment-850542572,https://api.github.com/repos/pydata/xarray/issues/5390,850542572,MDEyOklzc3VlQ29tbWVudDg1MDU0MjU3Mg==,5700886,2021-05-28T16:45:55Z,2021-05-28T16:45:55Z,CONTRIBUTOR,"@AndrewWilliams3142 @dcherian Looks like I broke the first Gist. :( Your Example above does not quite get there, because the `xr.DataArray(np...).chunk()` just leads to one chunk per data array. Here's a Gist that explains the idea for the correlations: https://nbviewer.jupyter.org/gist/willirath/c5c5274f31c98e8452548e8571158803 With ```python X = xr.DataArray( darr.random.normal(size=array_size, chunks=chunk_size), dims=(""t"", ""y"", ""x""), name=""X"", ) Y = xr.DataArray( darr.random.normal(size=array_size, chunks=chunk_size), dims=(""t"", ""y"", ""x""), name=""Y"", ) ``` the ""bad"" / explicit way of calculating the correlation ```python corr_exp = ((X - X.mean(""t"")) * (Y - Y.mean(""t""))).mean(""t"") ``` leads to a graph like this: ![image](https://user-images.githubusercontent.com/5700886/120015561-bd56dd00-bfe3-11eb-8ced-63c0b3ce7508.png) Dask won't release any of the tasks defining `X` and `Y` until the marked `sub`straction tasks are done. The ""good"" / aggregating way of calculting the correlation ```python corr_agg = (X * Y).mean(""t"") - X.mean(""t"") * Y.mean(""t"") ``` has the following graph ![image](https://user-images.githubusercontent.com/5700886/120016247-a4026080-bfe4-11eb-8d42-be5346496af6.png) where the marked `mul`tiplication and `mean_chunk` tasks are acting on only pairs of chunks and individual chunks and then release the original chunks of `X` and `Y`. This graph _can_ be evaluated with a much smaller memory foot print than the other one. (It's not certain that this is always leading to lower memory use, however. But this is a different issue ...)","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 1}",,904153867