home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

7 rows where author_association = "CONTRIBUTOR" and issue = 904153867 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 2

  • AndrewILWilliams 4
  • willirath 3

issue 1

  • Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` · 7 ✖

author_association 1

  • CONTRIBUTOR · 7 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
850843957 https://github.com/pydata/xarray/pull/5390#issuecomment-850843957 https://api.github.com/repos/pydata/xarray/issues/5390 MDEyOklzc3VlQ29tbWVudDg1MDg0Mzk1Nw== AndrewILWilliams 56925856 2021-05-29T14:37:48Z 2021-05-31T10:27:06Z CONTRIBUTOR

@willirath this is cool, but I think it doesn't explain why the tests fail. Currently da_a.mean() and the da_b.mean() calls do know about each other's missing data! That's what we're doing in these lines.

@dcherian, I think I've got it to work, but you need to account for the length(s) of the dimension you're calculating the correlation over. (i.e. (da-da.mean('time')).sum('time') is not the same as da.sum('time') - da.mean('time') because you should actually do da.sum('time') - da.mean('time')*length_of_time_dim)

This latest commit does this, but I'm not sure whether the added complication is worth it yet? Thoughts welcome. ```python3 def _mean(da): return (da.sum(dim=dim, skipna=True, min_count=1) / (valid_count))

dim_length = da_a.notnull().sum(dim=dim, skipna=True) def _mean_detrended_term(da): return (dim_length * da / (valid_count))

cov = _mean(da_a * da_b) - _mean_detrended_term(da_a.mean(dim=dim) * da_b.mean(dim=dim)) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 904153867
850820173 https://github.com/pydata/xarray/pull/5390#issuecomment-850820173 https://api.github.com/repos/pydata/xarray/issues/5390 MDEyOklzc3VlQ29tbWVudDg1MDgyMDE3Mw== willirath 5700886 2021-05-29T11:51:50Z 2021-05-29T11:51:59Z CONTRIBUTOR

I think the problem with

cov = _mean(da_a * da_b) - da_a.mean(dim=dim) * da_b.mean(dim=dim)

is that the da_a.mean() and the da_b.mean() calls don't know about each other's missing data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 904153867
850819741 https://github.com/pydata/xarray/pull/5390#issuecomment-850819741 https://api.github.com/repos/pydata/xarray/issues/5390 MDEyOklzc3VlQ29tbWVudDg1MDgxOTc0MQ== willirath 5700886 2021-05-29T11:48:02Z 2021-05-29T11:48:02Z CONTRIBUTOR

Shouldn't the following do? python cov = ( (da_a * da_b).mean(dim) - ( da_a.where(da_b.notnull()).mean(dim) * da_b.where(da_a.notnull()).mean(dim) ) ) (See here: https://nbviewer.jupyter.org/gist/willirath/cfaa8fb1b53fcb8dcb05ddde839c794c )

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 904153867
850690985 https://github.com/pydata/xarray/pull/5390#issuecomment-850690985 https://api.github.com/repos/pydata/xarray/issues/5390 MDEyOklzc3VlQ29tbWVudDg1MDY5MDk4NQ== AndrewILWilliams 56925856 2021-05-28T21:43:52Z 2021-05-28T21:44:12Z CONTRIBUTOR

is it just

cov = _mean(da_a * da_b) - da_a.mean(dim=dim) * da_b.mean(dim=dim)

I think you'd still have to normalize the second term by 1 / (valid_count). However, I just tried both of these approaches and neither pass the test suite, so we may need to do more thinking...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 904153867
850556738 https://github.com/pydata/xarray/pull/5390#issuecomment-850556738 https://api.github.com/repos/pydata/xarray/issues/5390 MDEyOklzc3VlQ29tbWVudDg1MDU1NjczOA== AndrewILWilliams 56925856 2021-05-28T17:12:52Z 2021-05-28T17:14:08Z CONTRIBUTOR

@willirath this is great stuff, thanks again! So generally it looks like the graph is more efficient when doing operations of the form:

python3 (X * Y).mean('time') - (X.mean('time') * Y.mean('time'))

than doing python3 ((X - X.mean('time')) * (Y-Y.mean('time'))).mean('time')

or like what I've implemented (see screenshot)? ```python3 intermediate = (X * Y) - (X.mean('time') * Y.mean('time'))

intermediate.mean('time') ```

If so, it seems like the most efficient(?) way to do the computation in _cov_corr() is to combine it all into one line? I can't think of how to do this though...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 904153867
850542572 https://github.com/pydata/xarray/pull/5390#issuecomment-850542572 https://api.github.com/repos/pydata/xarray/issues/5390 MDEyOklzc3VlQ29tbWVudDg1MDU0MjU3Mg== willirath 5700886 2021-05-28T16:45:55Z 2021-05-28T16:45:55Z CONTRIBUTOR

@AndrewWilliams3142 @dcherian Looks like I broke the first Gist. :(

Your Example above does not quite get there, because the xr.DataArray(np...).chunk() just leads to one chunk per data array.

Here's a Gist that explains the idea for the correlations: https://nbviewer.jupyter.org/gist/willirath/c5c5274f31c98e8452548e8571158803

With ```python X = xr.DataArray( darr.random.normal(size=array_size, chunks=chunk_size), dims=("t", "y", "x"), name="X", )

Y = xr.DataArray( darr.random.normal(size=array_size, chunks=chunk_size), dims=("t", "y", "x"), name="Y", ) the "bad" / explicit way of calculating the correlationpython corr_exp = ((X - X.mean("t")) * (Y - Y.mean("t"))).mean("t") ``` leads to a graph like this:

Dask won't release any of the tasks defining X and Y until the marked substraction tasks are done.

The "good" / aggregating way of calculting the correlation python corr_agg = (X * Y).mean("t") - X.mean("t") * Y.mean("t") has the following graph where the marked multiplication and mean_chunk tasks are acting on only pairs of chunks and individual chunks and then release the original chunks of X and Y. This graph can be evaluated with a much smaller memory foot print than the other one. (It's not certain that this is always leading to lower memory use, however. But this is a different issue ...)

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
  Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 904153867
850276619 https://github.com/pydata/xarray/pull/5390#issuecomment-850276619 https://api.github.com/repos/pydata/xarray/issues/5390 MDEyOklzc3VlQ29tbWVudDg1MDI3NjYxOQ== AndrewILWilliams 56925856 2021-05-28T09:15:30Z 2021-05-28T09:17:48Z CONTRIBUTOR

@willirath , thanks for your example notebook! I'm still trying to get my head around this a bit though.

Say you have da_a and da_b defined as:

```python3 da_a = xr.DataArray( np.array([[1, 2, 3, 4], [1, 0.1, 0.2, 0.3], [2, 3.2, 0.6, 1.8]]), dims=("space", "time"), coords=[ ("space", ["IA", "IL", "IN"]), ("time", pd.date_range("2000-01-01", freq="1D", periods=4)), ], ).chunk()

da_b = xr.DataArray( np.array([[0.2, 0.4, 0.6, 2], [15, 10, 5, 1], [1, 3.2, np.nan, 1.8]]), dims=("space", "time"), coords=[ ("space", ["IA", "IL", "IN"]), ("time", pd.date_range("2000-01-01", freq="1D", periods=4)), ], ).chunk() ```

The original computation in _cov_corr has a graph something like:

Whereas my alteration now has a graph more like this:

Am I correct in thinking that this is a 'better' computational graph? Because the original chunks are not passed onto later points in the computation?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 904153867

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.365ms · About: xarray-datasette