html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/5390#issuecomment-850843957,https://api.github.com/repos/pydata/xarray/issues/5390,850843957,MDEyOklzc3VlQ29tbWVudDg1MDg0Mzk1Nw==,56925856,2021-05-29T14:37:48Z,2021-05-31T10:27:06Z,CONTRIBUTOR,"@willirath this is cool, but I think it doesn't explain why the tests fail. Currently `da_a.mean()` and the `da_b.mean()` calls *do* know about each other's missing data! That's what we're doing in [these lines](https://github.com/pydata/xarray/blob/master/xarray/core/computation.py#L1370_L1371).
@dcherian, I think I've got it to work, but you need to account for the length(s) of the dimension you're calculating the correlation over. (i.e. `(da-da.mean('time')).sum('time')` is not the same as `da.sum('time') - da.mean('time')` because you should _actually_ do `da.sum('time') - da.mean('time')*length_of_time_dim`)
This latest commit does this, but I'm not sure whether the added complication is worth it yet? Thoughts welcome.
```python3
def _mean(da):
return (da.sum(dim=dim, skipna=True, min_count=1) / (valid_count))
dim_length = da_a.notnull().sum(dim=dim, skipna=True)
def _mean_detrended_term(da):
return (dim_length * da / (valid_count))
cov = _mean(da_a * da_b) - _mean_detrended_term(da_a.mean(dim=dim) * da_b.mean(dim=dim))
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867
https://github.com/pydata/xarray/pull/5390#issuecomment-849923230,https://api.github.com/repos/pydata/xarray/issues/5390,849923230,MDEyOklzc3VlQ29tbWVudDg0OTkyMzIzMA==,24736507,2021-05-27T20:34:28Z,2021-05-29T14:35:16Z,NONE,"Hello @AndrewWilliams3142! Thanks for updating this PR. We checked the lines you've touched for [PEPĀ 8](https://www.python.org/dev/peps/pep-0008) issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
##### Comment last updated at 2021-05-29 14:35:16 UTC","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867
https://github.com/pydata/xarray/pull/5390#issuecomment-850820173,https://api.github.com/repos/pydata/xarray/issues/5390,850820173,MDEyOklzc3VlQ29tbWVudDg1MDgyMDE3Mw==,5700886,2021-05-29T11:51:50Z,2021-05-29T11:51:59Z,CONTRIBUTOR,"I think the problem with
> `cov = _mean(da_a * da_b) - da_a.mean(dim=dim) * da_b.mean(dim=dim)`
is that the `da_a.mean()` and the `da_b.mean()` calls don't know about each other's missing data.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867
https://github.com/pydata/xarray/pull/5390#issuecomment-850819741,https://api.github.com/repos/pydata/xarray/issues/5390,850819741,MDEyOklzc3VlQ29tbWVudDg1MDgxOTc0MQ==,5700886,2021-05-29T11:48:02Z,2021-05-29T11:48:02Z,CONTRIBUTOR,"Shouldn't the following do?
```python
cov = (
(da_a * da_b).mean(dim)
- (
da_a.where(da_b.notnull()).mean(dim)
* da_b.where(da_a.notnull()).mean(dim)
)
)
```
(See here: https://nbviewer.jupyter.org/gist/willirath/cfaa8fb1b53fcb8dcb05ddde839c794c )","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867
https://github.com/pydata/xarray/pull/5390#issuecomment-850690985,https://api.github.com/repos/pydata/xarray/issues/5390,850690985,MDEyOklzc3VlQ29tbWVudDg1MDY5MDk4NQ==,56925856,2021-05-28T21:43:52Z,2021-05-28T21:44:12Z,CONTRIBUTOR,"> is it just
>
> ```
> cov = _mean(da_a * da_b) - da_a.mean(dim=dim) * da_b.mean(dim=dim)
> ```
I think you'd still have to normalize the second term by `1 / (valid_count)`. However, I just tried both of these approaches and neither pass the test suite, so we may need to do more thinking... ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867
https://github.com/pydata/xarray/pull/5390#issuecomment-850650732,https://api.github.com/repos/pydata/xarray/issues/5390,850650732,MDEyOklzc3VlQ29tbWVudDg1MDY1MDczMg==,2448579,2021-05-28T20:19:56Z,2021-05-28T20:20:52Z,MEMBER,"``` python
# 3. Detrend along the given dim
# 4. Compute covariance along the given dim
# N.B. `skipna=False` is required or there is a bug when computing
# auto-covariance. E.g. Try xr.cov(da,da) for
# da = xr.DataArray([[1, 2], [1, np.nan]], dims=[""x"", ""time""])
def _mean(da):
return da.sum(dim=dim, skipna=True, min_count=1) / (valid_count)
cov = _mean(da_a * da_b) - _mean(da_a.mean(dim=dim) * da_b.mean(dim=dim))
```
This second term looks very weird to me, it should be a no-op
``` python
_mean(da_a.mean(dim=dim) * da_b.mean(dim=dim))
```
is it just
```
cov = _mean(da_a * da_b) - da_a.mean(dim=dim) * da_b.mean(dim=dim)
```
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867
https://github.com/pydata/xarray/pull/5390#issuecomment-850556738,https://api.github.com/repos/pydata/xarray/issues/5390,850556738,MDEyOklzc3VlQ29tbWVudDg1MDU1NjczOA==,56925856,2021-05-28T17:12:52Z,2021-05-28T17:14:08Z,CONTRIBUTOR,"@willirath this is great stuff, thanks again! So generally it looks like the graph is more efficient when doing operations of the form:
```python3
(X * Y).mean('time') - (X.mean('time') * Y.mean('time'))
```
than doing
```python3
((X - X.mean('time')) * (Y-Y.mean('time'))).mean('time')
```
or like what I've implemented (see screenshot)?
```python3
intermediate = (X * Y) - (X.mean('time') * Y.mean('time'))
intermediate.mean('time')
```

If so, it seems like the most efficient(?) way to do the computation in [_cov_corr()](https://github.com/pydata/xarray/blob/master/xarray/core/computation.py#L1373_L1383) is to combine it all into one line? I can't think of how to do this though...
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867
https://github.com/pydata/xarray/pull/5390#issuecomment-850542572,https://api.github.com/repos/pydata/xarray/issues/5390,850542572,MDEyOklzc3VlQ29tbWVudDg1MDU0MjU3Mg==,5700886,2021-05-28T16:45:55Z,2021-05-28T16:45:55Z,CONTRIBUTOR,"@AndrewWilliams3142 @dcherian Looks like I broke the first Gist. :(
Your Example above does not quite get there, because the `xr.DataArray(np...).chunk()` just leads to one chunk per data array.
Here's a Gist that explains the idea for the correlations: https://nbviewer.jupyter.org/gist/willirath/c5c5274f31c98e8452548e8571158803
With
```python
X = xr.DataArray(
darr.random.normal(size=array_size, chunks=chunk_size),
dims=(""t"", ""y"", ""x""),
name=""X"",
)
Y = xr.DataArray(
darr.random.normal(size=array_size, chunks=chunk_size),
dims=(""t"", ""y"", ""x""),
name=""Y"",
)
```
the ""bad"" / explicit way of calculating the correlation
```python
corr_exp = ((X - X.mean(""t"")) * (Y - Y.mean(""t""))).mean(""t"")
```
leads to a graph like this:

Dask won't release any of the tasks defining `X` and `Y` until the marked `sub`straction tasks are done.
The ""good"" / aggregating way of calculting the correlation
```python
corr_agg = (X * Y).mean(""t"") - X.mean(""t"") * Y.mean(""t"")
```
has the following graph

where the marked `mul`tiplication and `mean_chunk` tasks are acting on only pairs of chunks and individual chunks and then release the original chunks of `X` and `Y`. This graph _can_ be evaluated with a much smaller memory foot print than the other one. (It's not certain that this is always leading to lower memory use, however. But this is a different issue ...)","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 1}",,904153867
https://github.com/pydata/xarray/pull/5390#issuecomment-850535282,https://api.github.com/repos/pydata/xarray/issues/5390,850535282,MDEyOklzc3VlQ29tbWVudDg1MDUzNTI4Mg==,2448579,2021-05-28T16:31:36Z,2021-05-28T16:31:36Z,MEMBER,"@AndrewWilliams3142 I think that's right. You can confirm these ideas by profiling a test problem: https://docs.dask.org/en/latest/diagnostics-local.html#example
It does seem like with the new version dask will hold on on to `da_a*da_b` for a while, which is an improvement over holding `da_a` and `da_b` separately for a while","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867
https://github.com/pydata/xarray/pull/5390#issuecomment-850276619,https://api.github.com/repos/pydata/xarray/issues/5390,850276619,MDEyOklzc3VlQ29tbWVudDg1MDI3NjYxOQ==,56925856,2021-05-28T09:15:30Z,2021-05-28T09:17:48Z,CONTRIBUTOR,"@willirath , thanks for your example notebook! I'm still trying to get my head around this a bit though.
Say you have `da_a` and `da_b` defined as:
```python3
da_a = xr.DataArray(
np.array([[1, 2, 3, 4], [1, 0.1, 0.2, 0.3], [2, 3.2, 0.6, 1.8]]),
dims=(""space"", ""time""),
coords=[
(""space"", [""IA"", ""IL"", ""IN""]),
(""time"", pd.date_range(""2000-01-01"", freq=""1D"", periods=4)),
],
).chunk()
da_b = xr.DataArray(
np.array([[0.2, 0.4, 0.6, 2], [15, 10, 5, 1], [1, 3.2, np.nan, 1.8]]),
dims=(""space"", ""time""),
coords=[
(""space"", [""IA"", ""IL"", ""IN""]),
(""time"", pd.date_range(""2000-01-01"", freq=""1D"", periods=4)),
],
).chunk()
```
The [original computation](https://github.com/pydata/xarray/blob/master/xarray/core/computation.py#L1373_L1375) in `_cov_corr` has a graph something like:

Whereas my alteration now has a graph more like this:

Am I correct in thinking that this is a 'better' computational graph? Because the original chunks are not passed onto later points in the computation? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,904153867