home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

40 rows where user = 56925856 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 10

  • Auto chunk 11
  • xr.cov() and xr.corr() 7
  • Automatic chunking of arrays ? 5
  • Function for regressing/correlating multiple fields? 4
  • Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 4
  • General curve fitting method 3
  • Dask-friendly nan check in xr.corr() and xr.cov() 3
  • Corrcov typo fix 1
  • Allow cov & corr to handle missing values 1
  • Possible bug with da.interp_like() 1

user 1

  • AndrewILWilliams · 40 ✖

author_association 1

  • CONTRIBUTOR 40
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
850843957 https://github.com/pydata/xarray/pull/5390#issuecomment-850843957 https://api.github.com/repos/pydata/xarray/issues/5390 MDEyOklzc3VlQ29tbWVudDg1MDg0Mzk1Nw== AndrewILWilliams 56925856 2021-05-29T14:37:48Z 2021-05-31T10:27:06Z CONTRIBUTOR

@willirath this is cool, but I think it doesn't explain why the tests fail. Currently da_a.mean() and the da_b.mean() calls do know about each other's missing data! That's what we're doing in these lines.

@dcherian, I think I've got it to work, but you need to account for the length(s) of the dimension you're calculating the correlation over. (i.e. (da-da.mean('time')).sum('time') is not the same as da.sum('time') - da.mean('time') because you should actually do da.sum('time') - da.mean('time')*length_of_time_dim)

This latest commit does this, but I'm not sure whether the added complication is worth it yet? Thoughts welcome. ```python3 def _mean(da): return (da.sum(dim=dim, skipna=True, min_count=1) / (valid_count))

dim_length = da_a.notnull().sum(dim=dim, skipna=True) def _mean_detrended_term(da): return (dim_length * da / (valid_count))

cov = _mean(da_a * da_b) - _mean_detrended_term(da_a.mean(dim=dim) * da_b.mean(dim=dim)) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 904153867
850690985 https://github.com/pydata/xarray/pull/5390#issuecomment-850690985 https://api.github.com/repos/pydata/xarray/issues/5390 MDEyOklzc3VlQ29tbWVudDg1MDY5MDk4NQ== AndrewILWilliams 56925856 2021-05-28T21:43:52Z 2021-05-28T21:44:12Z CONTRIBUTOR

is it just

cov = _mean(da_a * da_b) - da_a.mean(dim=dim) * da_b.mean(dim=dim)

I think you'd still have to normalize the second term by 1 / (valid_count). However, I just tried both of these approaches and neither pass the test suite, so we may need to do more thinking...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 904153867
850556738 https://github.com/pydata/xarray/pull/5390#issuecomment-850556738 https://api.github.com/repos/pydata/xarray/issues/5390 MDEyOklzc3VlQ29tbWVudDg1MDU1NjczOA== AndrewILWilliams 56925856 2021-05-28T17:12:52Z 2021-05-28T17:14:08Z CONTRIBUTOR

@willirath this is great stuff, thanks again! So generally it looks like the graph is more efficient when doing operations of the form:

python3 (X * Y).mean('time') - (X.mean('time') * Y.mean('time'))

than doing python3 ((X - X.mean('time')) * (Y-Y.mean('time'))).mean('time')

or like what I've implemented (see screenshot)? ```python3 intermediate = (X * Y) - (X.mean('time') * Y.mean('time'))

intermediate.mean('time') ```

If so, it seems like the most efficient(?) way to do the computation in _cov_corr() is to combine it all into one line? I can't think of how to do this though...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 904153867
850276619 https://github.com/pydata/xarray/pull/5390#issuecomment-850276619 https://api.github.com/repos/pydata/xarray/issues/5390 MDEyOklzc3VlQ29tbWVudDg1MDI3NjYxOQ== AndrewILWilliams 56925856 2021-05-28T09:15:30Z 2021-05-28T09:17:48Z CONTRIBUTOR

@willirath , thanks for your example notebook! I'm still trying to get my head around this a bit though.

Say you have da_a and da_b defined as:

```python3 da_a = xr.DataArray( np.array([[1, 2, 3, 4], [1, 0.1, 0.2, 0.3], [2, 3.2, 0.6, 1.8]]), dims=("space", "time"), coords=[ ("space", ["IA", "IL", "IN"]), ("time", pd.date_range("2000-01-01", freq="1D", periods=4)), ], ).chunk()

da_b = xr.DataArray( np.array([[0.2, 0.4, 0.6, 2], [15, 10, 5, 1], [1, 3.2, np.nan, 1.8]]), dims=("space", "time"), coords=[ ("space", ["IA", "IL", "IN"]), ("time", pd.date_range("2000-01-01", freq="1D", periods=4)), ], ).chunk() ```

The original computation in _cov_corr has a graph something like:

Whereas my alteration now has a graph more like this:

Am I correct in thinking that this is a 'better' computational graph? Because the original chunks are not passed onto later points in the computation?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improvements to lazy behaviour of `xr.cov()` and `xr.corr()` 904153867
848612330 https://github.com/pydata/xarray/pull/5284#issuecomment-848612330 https://api.github.com/repos/pydata/xarray/issues/5284 MDEyOklzc3VlQ29tbWVudDg0ODYxMjMzMA== AndrewILWilliams 56925856 2021-05-26T09:19:50Z 2021-05-26T09:19:50Z CONTRIBUTOR

Hey both, I've added a test to check that dask doesn't compute when calling either xr.corr() or xr.cov(), and also that the end result is still a dask array. Let me know if there's anything I've missed though! thanks for the help :)

@dcherian, regarding the apply_ufunc approach, I might leave that for now but as you said it can always be a future PR

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dask-friendly nan check in xr.corr() and xr.cov() 882876804
838231568 https://github.com/pydata/xarray/pull/5284#issuecomment-838231568 https://api.github.com/repos/pydata/xarray/issues/5284 MDEyOklzc3VlQ29tbWVudDgzODIzMTU2OA== AndrewILWilliams 56925856 2021-05-11T10:28:08Z 2021-05-12T20:45:00Z CONTRIBUTOR

Thanks for that @dcherian ! I didn't know you could use print debugging on chunked operations like this!

One thing actually: If I change da = da.where(missing_vals) to da = da.where(~missing_vals) then we get the results we'd expect. Do you think this fixes the problem?

``` def _get_valid_values(da, other): da1, da2 = xr.align(da, other, join="outer", copy=False)

# 2. Ignore the nans
missing_vals = np.logical_or(da1.isnull(), da2.isnull())

if missing_vals.any():
    da = da.where(~missing_vals)
    return da
else:
    return da

```

print(da_a.map_blocks(_get_valid_values, args=[da_b]).compute()) <xarray.DataArray (space: 3, time: 4)> array([[1. , 2. , 3. , 4. ], [1. , 0.1, 0.2, 0.3], [2. , 3.2, nan, 1.8]]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 * space (space) object 'IA' 'IL' 'IN' *

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dask-friendly nan check in xr.corr() and xr.cov() 882876804
837032429 https://github.com/pydata/xarray/pull/5284#issuecomment-837032429 https://api.github.com/repos/pydata/xarray/issues/5284 MDEyOklzc3VlQ29tbWVudDgzNzAzMjQyOQ== AndrewILWilliams 56925856 2021-05-10T17:44:29Z 2021-05-10T17:44:29Z CONTRIBUTOR

Hi @dcherian , just thinking about your suggestion for using map_blocks on the actual valid_values check. I've tested this and was wondering if you could maybe point to where I'm going wrong? It does mask out some of the values in a lazy way, but not the correct ones.

```python3 da_a = xr.DataArray( np.array([[1, 2, 3, 4], [1, 0.1, 0.2, 0.3], [2, 3.2, 0.6, 1.8]]), dims=("space", "time"), coords=[ ("space", ["IA", "IL", "IN"]), ("time", pd.date_range("2000-01-01", freq="1D", periods=4)), ], ).chunk({'time':1})

da_b = xr.DataArray( np.array([[0.2, 0.4, 0.6, 2], [15, 10, 5, 1], [1, 3.2, np.nan, 1.8]]), dims=("space", "time"), coords=[ ("space", ["IA", "IL", "IN"]), ("time", pd.date_range("2000-01-01", freq="1D", periods=4)), ], ).chunk({'time':1})

print(da_a)

<xarray.DataArray (space: 3, time: 4)> array([[1. , 2. , 3. , 4. ], [1. , 0.1, 0.2, 0.3], [2. , 3.2, 0.6, 1.8]]) Coordinates: * space (space) <U2 'IA' 'IL' 'IN' * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

print(da_b)

<xarray.DataArray (space: 3, time: 4)> array([[ 0.2, 0.4, 0.6, 2. ], [15. , 10. , 5. , 1. ], [ 1. , 3.2, nan, 1.8]]) Coordinates: * space (space) <U2 'IA' 'IL' 'IN' * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

Define function to use in map_blocks

def _get_valid_values(da, other): da1, da2 = xr.align(da, other, join="inner", copy=False)

# 2. Ignore the nans
missing_vals = np.logical_or(da1.isnull(), da2.isnull())

if missing_vals.any():
    da = da.where(missing_vals)
    return da
else:
    return da

test

outp = da_a.map_blocks(_get_valid_values, args=[da_b])

print(outp.compute())

<xarray.DataArray (space: 3, time: 4)> array([[1. , 2. , nan, 4. ], [1. , 0.1, nan, 0.3], [2. , 3.2, 0.6, 1.8]]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 * space (space) object 'IA' 'IL' 'IN' ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dask-friendly nan check in xr.corr() and xr.cov() 882876804
760905338 https://github.com/pydata/xarray/issues/4816#issuecomment-760905338 https://api.github.com/repos/pydata/xarray/issues/4816 MDEyOklzc3VlQ29tbWVudDc2MDkwNTMzOA== AndrewILWilliams 56925856 2021-01-15T12:09:32Z 2021-01-15T12:09:32Z CONTRIBUTOR

Oh actually, I don't think this is a bug. Because I'm only interpolating over 1d coordinates, scipy.interpolate.interp1d() is called, which requires 'fill_value':'extrapolate' in order to extrapolate rather than throwing a nan. If I was interpolating over multidimensional coordinates, then scipy.interpolate.interpnd() would have been called, which requires 'fill_value':None .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Possible bug with da.interp_like() 786839234
676494680 https://github.com/pydata/xarray/pull/4351#issuecomment-676494680 https://api.github.com/repos/pydata/xarray/issues/4351 MDEyOklzc3VlQ29tbWVudDY3NjQ5NDY4MA== AndrewILWilliams 56925856 2020-08-19T15:25:16Z 2020-08-19T15:25:16Z CONTRIBUTOR

The test is not in a function. I think the link to the test looks like it is indented but it's not...

My bad ;)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow cov & corr to handle missing values 681528611
673593388 https://github.com/pydata/xarray/issues/4300#issuecomment-673593388 https://api.github.com/repos/pydata/xarray/issues/4300 MDEyOklzc3VlQ29tbWVudDY3MzU5MzM4OA== AndrewILWilliams 56925856 2020-08-13T16:59:30Z 2020-08-13T16:59:30Z CONTRIBUTOR

cheers @TomNicholas , that's helpful. :) I've started messing with the idea in this Gist if you want to have a look.

It's pretty hacky at the moment, but might be helpful as a testbed. (And a way of getting my head around how apply_ufunc would work in this context)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  General curve fitting method 671609109
672084619 https://github.com/pydata/xarray/issues/4300#issuecomment-672084619 https://api.github.com/repos/pydata/xarray/issues/4300 MDEyOklzc3VlQ29tbWVudDY3MjA4NDYxOQ== AndrewILWilliams 56925856 2020-08-11T16:49:00Z 2020-08-11T16:49:29Z CONTRIBUTOR

@TomNicholas I'm a bit confused about how the fit_along argument would work actually. If you had 2D data and wanted to fit a 1D function to one of the dimensions, wouldn't you have to either take a mean (or slice?) across the other dimension?

Edit: It's been a hot day here, so apologies if this turns out to be a dumb q haha

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  General curve fitting method 671609109
671442450 https://github.com/pydata/xarray/issues/4300#issuecomment-671442450 https://api.github.com/repos/pydata/xarray/issues/4300 MDEyOklzc3VlQ29tbWVudDY3MTQ0MjQ1MA== AndrewILWilliams 56925856 2020-08-10T16:01:06Z 2020-08-10T16:01:06Z CONTRIBUTOR

This sounds very cool! :) I'm not sure that I have much to add, but given @aulemahal 's good point about the complexity of rewriting curve_fit from scratch, it seems that maybe a good first step would just be to wrap the existing scipy functionality?

Alternatively, given that xr.apply_ufunc can already do this (though it's probably complicated), perhaps it would be good to just have an example in the documentation?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  General curve fitting method 671609109
634184361 https://github.com/pydata/xarray/pull/4096#issuecomment-634184361 https://api.github.com/repos/pydata/xarray/issues/4096 MDEyOklzc3VlQ29tbWVudDYzNDE4NDM2MQ== AndrewILWilliams 56925856 2020-05-26T18:02:14Z 2020-05-26T18:02:14Z CONTRIBUTOR

Also, could I ask a git question? Is there a way of getting a "clean" version of xarray to do PR branches off of without just re-forking? I've tried a few different suggestions on StackOverflow but just keen to know what other people's workflow is :) thanks again

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Corrcov typo fix 625064501
634165154 https://github.com/pydata/xarray/pull/4089#issuecomment-634165154 https://api.github.com/repos/pydata/xarray/issues/4089 MDEyOklzc3VlQ29tbWVudDYzNDE2NTE1NA== AndrewILWilliams 56925856 2020-05-26T17:26:34Z 2020-05-26T17:26:34Z CONTRIBUTOR

@kefirbandi I didn't want to step on your toes, but I'm happy to put in a PR to fix the typo. :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.cov() and xr.corr() 623751213
633710066 https://github.com/pydata/xarray/pull/4064#issuecomment-633710066 https://api.github.com/repos/pydata/xarray/issues/4064 MDEyOklzc3VlQ29tbWVudDYzMzcxMDA2Ng== AndrewILWilliams 56925856 2020-05-25T20:38:49Z 2020-05-25T20:38:49Z CONTRIBUTOR

No problem ! Thanks everyone for helping me get up to speed :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Auto chunk 618828102
633592183 https://github.com/pydata/xarray/pull/4089#issuecomment-633592183 https://api.github.com/repos/pydata/xarray/issues/4089 MDEyOklzc3VlQ29tbWVudDYzMzU5MjE4Mw== AndrewILWilliams 56925856 2020-05-25T14:13:46Z 2020-05-25T14:13:46Z CONTRIBUTOR

If you insist ;)

da_a -= da_a.mean(dim=dim)

is indeed marginally faster. As they are already aligned, we don't have to worry about this.

Sweet! On second thought, I might leave it for now...the sun is too nice today. Can always have it as a future PR or something. :)

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.cov() and xr.corr() 623751213
633582528 https://github.com/pydata/xarray/pull/4089#issuecomment-633582528 https://api.github.com/repos/pydata/xarray/issues/4089 MDEyOklzc3VlQ29tbWVudDYzMzU4MjUyOA== AndrewILWilliams 56925856 2020-05-25T13:50:08Z 2020-05-25T13:50:08Z CONTRIBUTOR

One more thing actually, is there an argument for not defining da_a_std and demeaned_da_a and just performing the operations in place? Defining these variables makes the code more readable but in https://github.com/pydata/xarray/pull/3550#discussion_r355157809 and https://github.com/pydata/xarray/pull/3550#discussion_r355157888 the reviewer suggests this is inefficient?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.cov() and xr.corr() 623751213
633456698 https://github.com/pydata/xarray/pull/4089#issuecomment-633456698 https://api.github.com/repos/pydata/xarray/issues/4089 MDEyOklzc3VlQ29tbWVudDYzMzQ1NjY5OA== AndrewILWilliams 56925856 2020-05-25T08:44:36Z 2020-05-25T10:55:29Z CONTRIBUTOR

Could you also add a test for the TypeError?

python with raises_regex(TypeError, "Only xr.DataArray is supported"): xr.corr(xr.Dataset(), xr.Dataset())

Where do you mean sorry? Isn't this already there in corr()?

python3 if any(not isinstance(arr, (Variable, DataArray)) for arr in [da_a, da_b]): raise TypeError( "Only xr.DataArray and xr.Variable are supported." "Given {}.".format([type(arr) for arr in [da_a, da_b]]) ) EDIT: Scratch that, I get what you mean :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.cov() and xr.corr() 623751213
633441816 https://github.com/pydata/xarray/pull/4089#issuecomment-633441816 https://api.github.com/repos/pydata/xarray/issues/4089 MDEyOklzc3VlQ29tbWVudDYzMzQ0MTgxNg== AndrewILWilliams 56925856 2020-05-25T08:11:05Z 2020-05-25T08:12:48Z CONTRIBUTOR

Cheers ! I've got a day off today so I'll do another pass through the changes and see if there's any low-hanging fruit I can improve (in addition to np.random, _cov_corr internal methods and maybe apply_ufunc() ) :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.cov() and xr.corr() 623751213
633286352 https://github.com/pydata/xarray/pull/4089#issuecomment-633286352 https://api.github.com/repos/pydata/xarray/issues/4089 MDEyOklzc3VlQ29tbWVudDYzMzI4NjM1Mg== AndrewILWilliams 56925856 2020-05-24T19:49:30Z 2020-05-24T19:56:24Z CONTRIBUTOR

One problem I came across here is that pandas automatically ignores 'np.nan' values in any corr or cov calculation. This is hard-coded into the package and there's no skipna=False option sadly, so what I've done in the tests is to use the numpy implementation which pandas is built on (see, for example here).

Current tests implemented are (in pseudocode...): - [x] assert_allclose(xr.cov(a, b) / (a.std() * b.std()), xr.corr(a, b)) - [x] assert_allclose(xr.cov(a,a)*(N-1), ((a - a.mean())**2).sum()) - [x] For the example in my previous comment, I now have a loop over all values of (a,x) to reconstruct the covariance / correlation matrix, and check it with an assert_allclose(...). - [x] Add more test arrays, with/without np.nans -- done

@keewis I tried reading the Hypothesis docs and got a bit overwhelmed, so I've stuck with example-based tests for now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.cov() and xr.corr() 623751213
633213547 https://github.com/pydata/xarray/pull/4089#issuecomment-633213547 https://api.github.com/repos/pydata/xarray/issues/4089 MDEyOklzc3VlQ29tbWVudDYzMzIxMzU0Nw== AndrewILWilliams 56925856 2020-05-24T10:59:43Z 2020-05-24T11:00:53Z CONTRIBUTOR

The current problem is that we can't use Pandas to fully test xr.cov() or xr.corr() because once you convert the DataArrays to a series or a dataframe for testing, you can't easily index them with a dim parameter. See @r-beer 's comment here https://github.com/pydata/xarray/pull/3550#issuecomment-557895005.

As such, I think it maybe just makes sense to test a few low-dimensional cases? Eg

```python3

da_a = xr.DataArray( np.random.random((3, 21, 4)), coords={"time": pd.date_range("2000-01-01", freq="1D", periods=21)}, dims=("a", "time", "x"), )

da_b = xr.DataArray( np.random.random((3, 21, 4)), coords={"time": pd.date_range("2000-01-01", freq="1D", periods=21)}, dims=("a", "time", "x"), )

xr.cov(da_a, da_b, 'time') <xarray.DataArray (a: 3, x: 4)> array([[-0.01824046, 0.00373796, -0.00601642, -0.00108818], [ 0.00686132, -0.02680119, -0.00639433, -0.00868691], [-0.00889806, 0.02622817, -0.01022208, -0.00101257]]) Dimensions without coordinates: a, x xr.cov(da_a, da_b, 'time').sel(a=0,x=0) <xarray.DataArray ()> array(-0.01824046) da_a.sel(a=0,x=0).to_series().cov(da_b.sel(a=0,x=0).to_series()) -0.018240458880158048 ```

So, while it's easy to check that a few individual points from xr.cov() agree with the pandas implementation, it would require a loop over (a,x) in order to check all of the points for this example. Do people have thoughts about this?

I think it would also make sense to have some test cases where we don't use Pandas at all, but we specify the output manually?

```python3

da_a = xr.DataArray([[1, 2], [1, np.nan]], dims=["x", "time"]) expected = [1, np.nan] actual = xr.corr(da_a, da_a, dim='time') assert_allclose(actual, expected) ```

Does this seem like a good way forward?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.cov() and xr.corr() 623751213
633145887 https://github.com/pydata/xarray/issues/3784#issuecomment-633145887 https://api.github.com/repos/pydata/xarray/issues/3784 MDEyOklzc3VlQ29tbWVudDYzMzE0NTg4Nw== AndrewILWilliams 56925856 2020-05-23T21:58:49Z 2020-05-23T21:58:49Z CONTRIBUTOR

In a fit of covid-induced insanity, I've decided to have a crack at finishing up #3550 ! I'm playing around with the changes made by @r-beer at the moment, but I'm finding the tests quite confusing - I think they're wrong? But maybe someone could help me out with this?

Here's something from test_computation.py in #3550 ```python def test_cov(da_a, da_b, dim): def pandas_cov(ts1, ts2): """Ensure the ts are aligned and missing values ignored""" ts1, ts2 = xr.align(ts1, ts2) valid_values = ts1.notnull() & ts2.notnull()

    ts1 = ts1.where(valid_values, drop=True)
    ts2 = ts2.where(valid_values, drop=True)

    return ts1.to_series().cov(ts2.to_series())

expected = pandas_cov(da_a, da_b)
actual = xr.cov(da_a, da_b, dim)

assert_allclose(actual, expected)

```

What I don't understand is, why would we expect the Pandas covariance or correlation functions to return anything remotely like the output of xr.cov()? The line ts1.to_series().cov(ts2.to_series()) always produces a scalar value, whereas in most reasonable use cases xr.cov(da_a, da_b, dim) would be producing a matrix of values (eg. the pixel-wise correlation in time between two DataArrays).

I wasn't sure whether to open a PR for this or not? I'm working on it but would require some help to set up some appropriate tests...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Function for regressing/correlating multiple fields? 568378007
632128807 https://github.com/pydata/xarray/pull/4064#issuecomment-632128807 https://api.github.com/repos/pydata/xarray/issues/4064 MDEyOklzc3VlQ29tbWVudDYzMjEyODgwNw== AndrewILWilliams 56925856 2020-05-21T14:49:37Z 2020-05-21T14:49:37Z CONTRIBUTOR

@keewis thanks for this! I've added what I think is a suitable test for DataArrays, do you think it's also a good idea to have a DataSet test?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Auto chunk 618828102
632090407 https://github.com/pydata/xarray/pull/4064#issuecomment-632090407 https://api.github.com/repos/pydata/xarray/issues/4064 MDEyOklzc3VlQ29tbWVudDYzMjA5MDQwNw== AndrewILWilliams 56925856 2020-05-21T13:36:41Z 2020-05-21T13:36:41Z CONTRIBUTOR

This could test that dataarray.chunk("auto").data is the same as dataarray.data.rechunk("auto") (or something like that).

@dcherian Thanks for the tip:) Quick question: Is there a reason why you're specifying the .data here? Also I think I'm missing something because I don't get what the difference between .chunk() and .rechunk() would be in this case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Auto chunk 618828102
632035116 https://github.com/pydata/xarray/pull/4064#issuecomment-632035116 https://api.github.com/repos/pydata/xarray/issues/4064 MDEyOklzc3VlQ29tbWVudDYzMjAzNTExNg== AndrewILWilliams 56925856 2020-05-21T11:30:01Z 2020-05-21T11:30:01Z CONTRIBUTOR

Cheers! I forgot about the tests, will add them this week or next hopefully

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Auto chunk 618828102
629390609 https://github.com/pydata/xarray/pull/4064#issuecomment-629390609 https://api.github.com/repos/pydata/xarray/issues/4064 MDEyOklzc3VlQ29tbWVudDYyOTM5MDYwOQ== AndrewILWilliams 56925856 2020-05-15T17:40:39Z 2020-05-15T17:41:25Z CONTRIBUTOR

@dcherian do you have any idea about this mypy Type error? I can't find much (accessible) documentation on how the Union[] is working in this context.

xarray/core/dataset.py:1737: error: Argument 2 to "fromkeys" of "dict" has incompatible type "Union[Number, Mapping[Hashable, Union[None, Number, Tuple[Number, ...]]]]"; expected "Union[None, Number, Tuple[Number, ...]]" xarray/core/dataset.py:1740: error: Item "Number" of "Union[Number, Mapping[Hashable, Union[None, Number, Tuple[Number, ...]]]]" has no attribute "keys"

Edit: thanks to everyone for your help so far!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Auto chunk 618828102
629346101 https://github.com/pydata/xarray/pull/4064#issuecomment-629346101 https://api.github.com/repos/pydata/xarray/issues/4064 MDEyOklzc3VlQ29tbWVudDYyOTM0NjEwMQ== AndrewILWilliams 56925856 2020-05-15T16:11:04Z 2020-05-15T16:22:24Z CONTRIBUTOR

Okay so I've traced the error back to the map_blocks() function. I don't fully understand the code for this function in xarray/core/parallel.py, but here's a quick report on the different behaviours.

Normally, when using the make_ds() and make_da() functions in test_dask.py, without any changes to ds.chunk() we have:

```python

def func(obj): ... result = obj + obj.x + 5 * obj.y ... return result ... xr.map_blocks(func, ds).unify_chunks().chunks Frozen(SortedKeysDict({'x': (4, 4, 2), 'y': (5, 5, 5, 5), 'z': (4,)})) func(ds).chunk().unify_chunks().chunks Frozen(SortedKeysDict({'x': (4, 4, 2), 'y': (5, 5, 5, 5), 'z': (4,)})) ```

However, when I use the changes I've made to dataset.py (changing isinstance(chunks, Number) to is_scalar(chunks)), the behaviour becomes:

```python

xr.map_blocks(func, ds).unify_chunks().chunks Frozen(SortedKeysDict({'x': (4, 4, 2), 'y': (5, 5, 5, 5), 'z': (4,)})) func(ds).chunk().unify_chunks().chunks Frozen(SortedKeysDict({'x': (10,), 'y': (20,), 'z': (4,)})) ```

Which means that it now fails the test_map_blocks() call in test_dask.py line 1077.

I've tried to follow through the code and see what is actually happening when this change is made, but I'm out of my depth here. My guess is that is_scalar(chunks) is giving the wrong behaviour when chunks=None ?

Edit: I think that's the problem!
```python

isinstance(None, numbers.Number) False is_scalar(None) True ```

I'll add in something to catch Nones and see if it fixes the error...

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Auto chunk 618828102
629197362 https://github.com/pydata/xarray/pull/4064#issuecomment-629197362 https://api.github.com/repos/pydata/xarray/issues/4064 MDEyOklzc3VlQ29tbWVudDYyOTE5NzM2Mg== AndrewILWilliams 56925856 2020-05-15T12:05:22Z 2020-05-15T12:05:22Z CONTRIBUTOR

No unpushed commits

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Auto chunk 618828102
629191037 https://github.com/pydata/xarray/pull/4064#issuecomment-629191037 https://api.github.com/repos/pydata/xarray/issues/4064 MDEyOklzc3VlQ29tbWVudDYyOTE5MTAzNw== AndrewILWilliams 56925856 2020-05-15T11:49:23Z 2020-05-15T11:49:23Z CONTRIBUTOR

Do you mean the master merge? If that's wrong would you be able to fix it for me? My bad, hopefully i'll be able to do it more cleanly in future

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Auto chunk 618828102
629168282 https://github.com/pydata/xarray/pull/4064#issuecomment-629168282 https://api.github.com/repos/pydata/xarray/issues/4064 MDEyOklzc3VlQ29tbWVudDYyOTE2ODI4Mg== AndrewILWilliams 56925856 2020-05-15T10:49:43Z 2020-05-15T10:49:43Z CONTRIBUTOR

Okay, that makes sense. Though, it seems that I forked the master branch before @kmuehlbauer's commit, which fixed this flake8 issue? So I think I need to make a new fork?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Auto chunk 618828102
629154336 https://github.com/pydata/xarray/pull/4064#issuecomment-629154336 https://api.github.com/repos/pydata/xarray/issues/4064 MDEyOklzc3VlQ29tbWVudDYyOTE1NDMzNg== AndrewILWilliams 56925856 2020-05-15T10:15:50Z 2020-05-15T10:17:38Z CONTRIBUTOR

Okay cheers both! I'll have a look at these now.

@keewis sorry I'm still getting used to using this side of Git at the moment, could you clarify what you mean by merge master ? Do you mean merge with my local master?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Auto chunk 618828102
629147818 https://github.com/pydata/xarray/pull/4064#issuecomment-629147818 https://api.github.com/repos/pydata/xarray/issues/4064 MDEyOklzc3VlQ29tbWVudDYyOTE0NzgxOA== AndrewILWilliams 56925856 2020-05-15T10:00:35Z 2020-05-15T10:01:43Z CONTRIBUTOR

In my git clone, when I run the flake8 and black . tests, I get the following messages.

(xarray-tests) Andrews-MacBook-Pro-2:xarray andrewwilliams$ black . All done! ✨ 🍰 ✨ 143 files left unchanged. (xarray-tests) Andrews-MacBook-Pro-2:xarray andrewwilliams$ flake8 ./xarray/backends/memory.py:43:32: E741 ambiguous variable name 'l' ./xarray/backends/common.py:244:32: E741 ambiguous variable name 'l' ./xarray/backends/.ipynb_checkpoints/memory-checkpoint.py:43:32: E741 ambiguous variable name 'l'

I'm not sure why something has changed in these files (I haven't touched them), I also can't work out what the l variable is meant to be doing there.

Could this somehow be associated with loads of the checks failing below? Thanks! :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Auto chunk 618828102
628797255 https://github.com/pydata/xarray/issues/4055#issuecomment-628797255 https://api.github.com/repos/pydata/xarray/issues/4055 MDEyOklzc3VlQ29tbWVudDYyODc5NzI1NQ== AndrewILWilliams 56925856 2020-05-14T18:01:45Z 2020-05-14T18:01:45Z CONTRIBUTOR

I also thought that, after the dask error message it's pretty easy to then look at the dataset and check what the problem dimension is.

In general though, is that the type of layout you'd suggest for catching and re-raising errors? Using raise Exception() ?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic chunking of arrays ? 617476316
628616379 https://github.com/pydata/xarray/issues/4055#issuecomment-628616379 https://api.github.com/repos/pydata/xarray/issues/4055 MDEyOklzc3VlQ29tbWVudDYyODYxNjM3OQ== AndrewILWilliams 56925856 2020-05-14T12:57:21Z 2020-05-14T17:50:31Z CONTRIBUTOR

Nice, that's neater! Would this work, in the maybe_chunk() call? Sorry about the basic questions!

python def maybe_chunk(name, var, chunks): chunks = selkeys(chunks, var.dims) if not chunks: chunks = None if var.ndim > 0: # when rechunking by different amounts, make sure dask names change # by provinding chunks as an input to tokenize. # subtle bugs result otherwise. see GH3350 token2 = tokenize(name, token if token else var._data, chunks) name2 = f"{name_prefix}{name}-{token2}" try: return var.chunk(chunks, name=name2, lock=lock) except NotImplementedError as err: raise Exception("Automatic chunking fails for object arrays." + "These include cftime DataArrays.") else: return var

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic chunking of arrays ? 617476316
628513777 https://github.com/pydata/xarray/issues/4055#issuecomment-628513777 https://api.github.com/repos/pydata/xarray/issues/4055 MDEyOklzc3VlQ29tbWVudDYyODUxMzc3Nw== AndrewILWilliams 56925856 2020-05-14T09:26:24Z 2020-05-14T09:26:24Z CONTRIBUTOR

Also, the contributing docs have been super clear so far! Thanks! :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic chunking of arrays ? 617476316
628513443 https://github.com/pydata/xarray/issues/4055#issuecomment-628513443 https://api.github.com/repos/pydata/xarray/issues/4055 MDEyOklzc3VlQ29tbWVudDYyODUxMzQ0Mw== AndrewILWilliams 56925856 2020-05-14T09:25:48Z 2020-05-14T09:25:48Z CONTRIBUTOR

Cheers! Just had a look, is it as simple as just changing this line to the following, @dcherian ?

python if isinstance(chunks, Number) or chunks=='auto': chunks = dict.fromkeys(self.dims, chunks)

This seems to work fine in a lot of cases, except automatic chunking isn't implemented for object dtypes at the moment, so it fails if you pass a cftime coordinate, for example.

One option is to automatically use self=xr.decode_cf(self) if the input dataset is cftime? Or could just throw an error.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic chunking of arrays ? 617476316
628212516 https://github.com/pydata/xarray/issues/4055#issuecomment-628212516 https://api.github.com/repos/pydata/xarray/issues/4055 MDEyOklzc3VlQ29tbWVudDYyODIxMjUxNg== AndrewILWilliams 56925856 2020-05-13T19:56:34Z 2020-05-13T19:56:34Z CONTRIBUTOR

Oh ok I didn't know about this, I'll take a look and read the contribution docs tomorrow ! It'll be my first PR so may need a bit of hand-holding when it comes to tests. Willing to try though!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic chunking of arrays ? 617476316
625345337 https://github.com/pydata/xarray/issues/3784#issuecomment-625345337 https://api.github.com/repos/pydata/xarray/issues/3784 MDEyOklzc3VlQ29tbWVudDYyNTM0NTMzNw== AndrewILWilliams 56925856 2020-05-07T16:02:43Z 2020-05-07T16:02:43Z CONTRIBUTOR

Hi @max-sixty, just coming back to this now. It seems @r-beer isn't available...do you know roughly how far away his PR was from completion? I'm getting a little bit lost trying to follow #3550 sorry!

Was the main todo to avoid the drop=True after broadcasting? Is there any idea about what to do instead?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Function for regressing/correlating multiple fields? 568378007
589368049 https://github.com/pydata/xarray/issues/3784#issuecomment-589368049 https://api.github.com/repos/pydata/xarray/issues/3784 MDEyOklzc3VlQ29tbWVudDU4OTM2ODA0OQ== AndrewILWilliams 56925856 2020-02-20T22:08:01Z 2020-02-20T22:08:01Z CONTRIBUTOR

@max-sixty Just had a peruse through a few of the relevant issues, do we know what the status of [#3550 ] is? It seems like @r-beer was pretty close on this, right?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Function for regressing/correlating multiple fields? 568378007
589328765 https://github.com/pydata/xarray/issues/3784#issuecomment-589328765 https://api.github.com/repos/pydata/xarray/issues/3784 MDEyOklzc3VlQ29tbWVudDU4OTMyODc2NQ== AndrewILWilliams 56925856 2020-02-20T21:32:26Z 2020-02-20T21:32:26Z CONTRIBUTOR

I'll take a look at them!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Function for regressing/correlating multiple fields? 568378007

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 23.455ms · About: xarray-datasette