html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/2922#issuecomment-601612380,https://api.github.com/repos/pydata/xarray/issues/2922,601612380,MDEyOklzc3VlQ29tbWVudDYwMTYxMjM4MA==,10194086,2020-03-20T09:45:23Z,2020-10-27T14:47:22Z,MEMBER,"tldr: if someone knows how to do memory profiling with reasonable effort this can still be changed

It's certainly not too late to change the ""backend"" of the weighting functions. I once tried to profile the memory usage but I gave up at some point (I think I would have needed to annotate a ton of functions, also in numpy).

@fujiisoup suggested using `xr.dot(a, b)` (instead of `(a * b).sum()`) to ameliorate part of the memory footprint. ~Which is done, however, this comes at a performance penalty. So an alternative code path might actually be beneficial.~ (edit: I now think `xr.dot` is actually faster, except for very small arrays).

Also `mask` is an array of dtype bool, which should help. 

It think it should not be very difficult to write something that can be passed to `apply_ufinc`, probably similar to:

https://github.com/pydata/xarray/blob/e8a284f341645a63a4d83676a6b268394c721bbc/xarray/tests/test_weighted.py#L161

So there would be three possibilities: (1) the current implementation (using `xr.dot(a, b)`) (2) something similar to `expected_weighted` (using `(a * b).sum()`) (3) `xr.apply_ufunc(a, b, expected_weighted, ...)`. I assume (2) is fastest with the largest memory footprint, but I cannot tell about (1) and (3).


","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601885539,https://api.github.com/repos/pydata/xarray/issues/2922,601885539,MDEyOklzc3VlQ29tbWVudDYwMTg4NTUzOQ==,7441788,2020-03-20T19:57:54Z,2020-03-20T20:00:20Z,CONTRIBUTOR,"All good points:

> What could be done, though is to only do da = da.fillna(0.0) if da contains NaNs.

Good idea, though I don't know what the performance hit would be of the extra check (in the case that da does contain NaNs, so the check is for naught).

> I assume so. I don't know what kind of temporary variables np.einsum creates. Also np.einsum is wrapped in xr.apply_ufunc so all kinds of magic is going on.

Well, `(da * weights)` will be at least as large as `da`. I'm not certain, but I don't think np.einsum creates huge temporary arrays.

> Do you want to leave it away for performance reasons? Because it was a deliberate decision to not support NaNs in the weights and I don't think this is going to change.

Yes. You can continue not supporting NaNs in the weights, yet not explicitly check that there are no NaNs (optionally, if the caller assures you that there are no NaNs).

> None of your suggested functions support NaNs so they won't work.

Correct. These have nothing to do with the NaNs issue.

For profiling memory usage, I use `psutil.Process(os.getpid()).memory_info().rss` for current usage and `resource.getusage(resource.RUSAGE_SElF).ru_maxrss` for peak usage (on linux).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601824129,https://api.github.com/repos/pydata/xarray/issues/2922,601824129,MDEyOklzc3VlQ29tbWVudDYwMTgyNDEyOQ==,10194086,2020-03-20T17:31:15Z,2020-03-20T17:31:15Z,MEMBER,"There is some stuff I can do to reduce the memory footprint if `skipna=False` or `not da.isnull().any()`. Also, the functions should support `dask` arrays out of the box. 

---

> ideally `dot()` would support `skipna`, so you could eliminate the `da = da.fillna(0.0)` and pass the `skipna` down the line. But alas it doesn't...

Yes, this would be nice. `xr.dot` uses `np.einsum` which is [quite a beast](https://github.com/numpy/numpy/blob/v1.17.0/numpy/core/einsumfunc.py) that I don't entirely see through. I don't expect it to support `NaN`s any time soon.

What could be done, though is to only do `da = da.fillna(0.0)` if `da` contains `NaN`s.

> `(da * weights).sum(dim=dim, skipna=skipna)` would likely make things worse, I think, as it would necessarily create a temporary array of sized at least `da`, no?

I assume so.  I don't know what kind of temporary variables `np.einsum` creates. Also `np.einsum` is wrapped in `xr.apply_ufunc` so all kinds of magic is going on.

> Either way, this only addresses the `da = da.fillna(0.0)`, not the `mask = da.notnull()`.

Again this could be avoided if `skipna=False` or if (and only if) there are no `NaN`s in da.

> Also, perhaps the test `if weights.isnull().any()` in `Weighted.__init__()` should be optional?

Do you want to leave it away for performance reasons? Because it was a deliberate decision to *not* support `NaN`s in the weights and I don't think this is going to change.

> Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays.

No it's important to make sure this stuff works for large arrays. However, using `xr.dot` already gives quite a performance penalty, which I am not super happy about.

> have you considered using these functions? [...]

None of your suggested functions support `NaNs` so they won't work. 

I am all in to support more functions, but currently I am happy we got a weighted sum and mean into xarray after 5(!) years!

Further libraries that support weighted operations:

* [esmlab](https://github.com/NCAR/esmlab/blob/master/esmlab/statistics.py) (xarray-based, supports NaN)
* [statsmodels](https://www.statsmodels.org/dev/_modules/statsmodels/stats/weightstats.html) (numpy-based, does not support NaN)
","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601709733,https://api.github.com/repos/pydata/xarray/issues/2922,601709733,MDEyOklzc3VlQ29tbWVudDYwMTcwOTczMw==,7441788,2020-03-20T13:47:39Z,2020-03-20T16:31:14Z,CONTRIBUTOR,"@mathause, have you considered using these functions?
- [np.average()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html) to calculate weighted `mean()`.
- [np.cov()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html) to calculate weighted `cov()`, `var()`, and `std()`.
- [sp.stats.cumfreq()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.cumfreq.html) to calculate weighted `median()` (I haven't thought this through).
- [sp.spatial.distance.correlation()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.correlation.html) to calculate weighted `corrcoef()`. (Of course one could also calculate this from weighted `cov()` (see above), but first need to mask the two arrays simultaneously.)
- [sklearn.utils.extmath.weighted_mode()](https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.weighted_mode.html) to calculate weighted `mode()`.
- [gmisclib.weighted_percentile.{wp,wtd_median}()](http://kochanski.org/gpk/code/speechresearch/gmisclib/gmisclib.weighted_percentile-module.html) to calculate weighted `quantile()` and `median()`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601708110,https://api.github.com/repos/pydata/xarray/issues/2922,601708110,MDEyOklzc3VlQ29tbWVudDYwMTcwODExMA==,7441788,2020-03-20T13:44:03Z,2020-03-20T13:52:06Z,CONTRIBUTOR,"@mathause, ideally `dot()` would support `skipna`, so you could eliminate the `da = da.fillna(0.0)` and pass the `skipna` down the line. But alas it doesn't...

`(da * weights).sum(dim=dim, skipna=skipna)` would likely make things worse, I think, as it would necessarily create a temporary array of sized at least `da`, no?

Either way, this only addresses the `da = da.fillna(0.0)`, not the `mask = da.notnull()`.

Also, perhaps the test `if weights.isnull().any()` in `Weighted.__init__()` should be optional?

Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601699091,https://api.github.com/repos/pydata/xarray/issues/2922,601699091,MDEyOklzc3VlQ29tbWVudDYwMTY5OTA5MQ==,7441788,2020-03-20T13:25:21Z,2020-03-20T13:25:21Z,CONTRIBUTOR,"@max-sixty, I wish I could, but I'm afraid that I cannot submit code due to employer limitations.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601514904,https://api.github.com/repos/pydata/xarray/issues/2922,601514904,MDEyOklzc3VlQ29tbWVudDYwMTUxNDkwNA==,5635139,2020-03-20T04:01:34Z,2020-03-20T04:01:34Z,MEMBER,"We do those sorts of operations fairly frequently, so it's not unique here. Generally users should expect to have available ~3x the memory of an array for most operations.

@seth-p it's great you've taken an interest in the project! Is there any chance we could harness that into some contributions? 😄 ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601496897,https://api.github.com/repos/pydata/xarray/issues/2922,601496897,MDEyOklzc3VlQ29tbWVudDYwMTQ5Njg5Nw==,7441788,2020-03-20T02:11:53Z,2020-03-20T02:12:24Z,CONTRIBUTOR,"I realize this is a bit late, but I'm still concerned about memory usage, specifically in https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L130 and https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L143.
If `da.sizes = {'dim_0': 100000, 'dim_1': 100000}`, the two lines above will cause `da.weighted(weights).mean('dim_0')` to create two simultaneous temporary 100000x100000 arrays, which could be problematic.

I would have implemented this using ``apply_ufunc``, so that one creates these temporary variables only on as small an array as absolutely necessary -- in this case just of size `sizes['dim_0'] = 100000`. (Much as I would like to, I'm afraid I'm not able to contribute code.) Of course this won't help in the case one is summing over all dimensions, but might as well minimize memory usage in some cases even if not in all.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601377953,https://api.github.com/repos/pydata/xarray/issues/2922,601377953,MDEyOklzc3VlQ29tbWVudDYwMTM3Nzk1Mw==,5635139,2020-03-19T19:34:42Z,2020-03-19T19:34:42Z,MEMBER,"> #422 was opened in June of 2015, amazing.

😂 

@mathause props for the persistence...","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601298407,https://api.github.com/repos/pydata/xarray/issues/2922,601298407,MDEyOklzc3VlQ29tbWVudDYwMTI5ODQwNw==,2443309,2020-03-19T16:58:57Z,2020-03-19T16:58:57Z,MEMBER,"Big time!!!! Thanks @mathause! #422 was opened in June of 2015, amazing.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601283025,https://api.github.com/repos/pydata/xarray/issues/2922,601283025,MDEyOklzc3VlQ29tbWVudDYwMTI4MzAyNQ==,5635139,2020-03-19T16:37:43Z,2020-03-19T16:37:43Z,MEMBER,Thanks @mathause !,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601214104,https://api.github.com/repos/pydata/xarray/issues/2922,601214104,MDEyOklzc3VlQ29tbWVudDYwMTIxNDEwNA==,10194086,2020-03-19T14:35:25Z,2020-03-19T14:35:25Z,MEMBER,Great! Thanks for all the feedback and support!,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601210885,https://api.github.com/repos/pydata/xarray/issues/2922,601210885,MDEyOklzc3VlQ29tbWVudDYwMTIxMDg4NQ==,2448579,2020-03-19T14:29:42Z,2020-03-19T14:29:42Z,MEMBER,"This is going in. 

Thanks @mathause. This is a major contribution!","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-487130921,https://api.github.com/repos/pydata/xarray/issues/2922,487130921,MDEyOklzc3VlQ29tbWVudDQ4NzEzMDkyMQ==,24736507,2019-04-26T17:09:07Z,2020-03-18T01:42:06Z,NONE,"Hello @mathause! Thanks for updating this PR. We checked the lines you've touched for [PEP 8](https://www.python.org/dev/peps/pep-0008) issues, and found:


There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers: 

##### Comment last updated at 2020-03-18 01:42:05 UTC","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-595373665,https://api.github.com/repos/pydata/xarray/issues/2922,595373665,MDEyOklzc3VlQ29tbWVudDU5NTM3MzY2NQ==,10194086,2020-03-05T18:18:22Z,2020-03-05T18:18:22Z,MEMBER,I updated this once more. Mostly moved the example to a notebook.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-562206026,https://api.github.com/repos/pydata/xarray/issues/2922,562206026,MDEyOklzc3VlQ29tbWVudDU2MjIwNjAyNg==,10194086,2019-12-05T16:29:51Z,2019-12-05T16:29:51Z,MEMBER,This is now ready for a full review. I added tests for weighted reductions over several dimensions and docs.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-545512847,https://api.github.com/repos/pydata/xarray/issues/2922,545512847,MDEyOklzc3VlQ29tbWVudDU0NTUxMjg0Nw==,10194086,2019-10-23T15:55:35Z,2019-10-23T15:55:35Z,MEMBER,">> I decided to replace all NaN in the weights with 0.
> Can we raise an error instead? It should be easy for the user to do weights.fillna(0) instead of relying on xarray's magical behaviour.

I agree, requiring valid weights is a sensible choice. 

>>  if weights sum to 0 it returns NaN (and not inf)
> Should we raise an error here?

Im not sure... Assume I want to do a meridional mean and only have data over land, this would then raise an error, which is not what I want. 
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-545200082,https://api.github.com/repos/pydata/xarray/issues/2922,545200082,MDEyOklzc3VlQ29tbWVudDU0NTIwMDA4Mg==,2448579,2019-10-22T23:35:52Z,2019-10-22T23:35:52Z,MEMBER,"> I decided to replace all NaN in the weights with 0.

Can we raise an error instead? It should be easy for the user to do `weights.fillna(0)` instead of relying on xarray's magical behaviour.

>    if weights sum to 0 it returns NaN (and not inf)

Should we raise an error here?

>  The following returns NaN (could be 1)

I think NaN is fine since that's the result of `1 + 0*np.nan`
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-543358453,https://api.github.com/repos/pydata/xarray/issues/2922,543358453,MDEyOklzc3VlQ29tbWVudDU0MzM1ODQ1Mw==,10194086,2019-10-17T20:56:32Z,2019-10-17T20:59:08Z,MEMBER,"I finally made some time to work on this - altough I feel far from finished...

* added a `DatasetWeighted` class
* for this I pulled the functionality our of `DataArrayWeighted` class in to own functions taking `da` and `weights` as input
* the tests need more work
* implanted the functionality using `xr.dot` -> this makes the logic a bit more complicated
* I think the failure in `Linux py37-upstream-dev` is unrelated

Questions:
* does this implementation look reasonable to you?
* `xr.dot` does not have a `axis` keyword is it fine if I leave it out in my functions?
* flake8 fails because I use  `@overload` for typing -> should I remove this?
* Currently I have the functionality 3-times: once as `_weighted_sum`, once as `da.weighted.sum()` and once as `ds.weighted().sum`: how do I best test this?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-512243216,https://api.github.com/repos/pydata/xarray/issues/2922,512243216,MDEyOklzc3VlQ29tbWVudDUxMjI0MzIxNg==,10194086,2019-07-17T12:59:16Z,2019-07-17T12:59:16Z,MEMBER,"Thanks, I am still very interested to get this in. I don't think I'll manage before my holidays, though.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-511002355,https://api.github.com/repos/pydata/xarray/issues/2922,511002355,MDEyOklzc3VlQ29tbWVudDUxMTAwMjM1NQ==,1197350,2019-07-12T19:16:16Z,2019-07-12T19:16:16Z,MEMBER,"Hi @mathause - We really appreciate your contribution. Sorry your PR has stalled! 

Do you think you can respond to @fujiisoup's review and add documentation? Then we can get this merged.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-488031173,https://api.github.com/repos/pydata/xarray/issues/2922,488031173,MDEyOklzc3VlQ29tbWVudDQ4ODAzMTE3Mw==,10194086,2019-04-30T16:57:05Z,2019-04-30T16:57:05Z,MEMBER,"I updated the PR
* added a weighted `sum` method
* tried to clean the tests

Before I continue, it would be nice to get some feedback.

* Currently I only do very simple tests - is that enough? How could these be generalized without re-implementing the functions to obtain `expected`.
* Eventually it would be cool to be able to do `da.groupby(regions).weighted(weights).mean()` - do you see a possibility for this with the current implementation?

As mentioned by @aaronspring, esmlab already implemented [weighted statistic functions](
https://github.com/NCAR/esmlab/blob/master/esmlab/statistics.py). Similarly, statsmodels for 1D data without handling of NaNs ([docs](https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.DescrStatsW.html#statsmodels.stats.weightstats.DescrStatsW) / [code](https://www.statsmodels.org/dev/_modules/statsmodels/stats/weightstats.html#DescrStatsW)). Thus it should be feasible to implement further statistics here (weighted `std`).

","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416