html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/5390#issuecomment-850820173,https://api.github.com/repos/pydata/xarray/issues/5390,850820173,MDEyOklzc3VlQ29tbWVudDg1MDgyMDE3Mw==,5700886,2021-05-29T11:51:50Z,2021-05-29T11:51:59Z,CONTRIBUTOR,"I think the problem with
> `cov = _mean(da_a * da_b) - da_a.mean(dim=dim) * da_b.mean(dim=dim)`

is that the `da_a.mean()` and the `da_b.mean()` calls don't know about each other's missing data.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867
https://github.com/pydata/xarray/pull/5390#issuecomment-850819741,https://api.github.com/repos/pydata/xarray/issues/5390,850819741,MDEyOklzc3VlQ29tbWVudDg1MDgxOTc0MQ==,5700886,2021-05-29T11:48:02Z,2021-05-29T11:48:02Z,CONTRIBUTOR,"Shouldn't the following do?
```python
cov = (
    (da_a * da_b).mean(dim) 
    - (
        da_a.where(da_b.notnull()).mean(dim)
        * da_b.where(da_a.notnull()).mean(dim)
    )
)
```
(See here: https://nbviewer.jupyter.org/gist/willirath/cfaa8fb1b53fcb8dcb05ddde839c794c )","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867
https://github.com/pydata/xarray/pull/5390#issuecomment-850542572,https://api.github.com/repos/pydata/xarray/issues/5390,850542572,MDEyOklzc3VlQ29tbWVudDg1MDU0MjU3Mg==,5700886,2021-05-28T16:45:55Z,2021-05-28T16:45:55Z,CONTRIBUTOR,"@AndrewWilliams3142 @dcherian Looks like I broke the first Gist. :(

Your Example above does not quite get there, because the `xr.DataArray(np...).chunk()` just leads to one chunk per data array.

Here's a Gist that explains the idea for the correlations: https://nbviewer.jupyter.org/gist/willirath/c5c5274f31c98e8452548e8571158803

With
```python
X = xr.DataArray(
    darr.random.normal(size=array_size, chunks=chunk_size),
    dims=(""t"", ""y"", ""x""),
    name=""X"",
)

Y = xr.DataArray(
    darr.random.normal(size=array_size, chunks=chunk_size),
    dims=(""t"", ""y"", ""x""),
    name=""Y"",
)
```
the ""bad"" / explicit way of calculating the correlation
```python
corr_exp = ((X - X.mean(""t"")) * (Y - Y.mean(""t""))).mean(""t"")
```
leads to a graph like this:
![image](https://user-images.githubusercontent.com/5700886/120015561-bd56dd00-bfe3-11eb-8ced-63c0b3ce7508.png)

Dask won't release any of the tasks defining `X` and `Y` until the marked `sub`straction tasks are done.

The ""good"" / aggregating way of calculting the correlation
```python
corr_agg = (X * Y).mean(""t"") - X.mean(""t"") * Y.mean(""t"")
```
has the following graph
![image](https://user-images.githubusercontent.com/5700886/120016247-a4026080-bfe4-11eb-8d42-be5346496af6.png)
where the marked `mul`tiplication and `mean_chunk` tasks are acting on only pairs of chunks and individual chunks and then release the original chunks of `X` and `Y`. This graph _can_ be evaluated with a much smaller memory foot print than the other one. (It's not certain that this is always leading to lower memory use, however. But this is a different issue ...)","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 1}",,904153867