html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/2922#issuecomment-601612380,https://api.github.com/repos/pydata/xarray/issues/2922,601612380,MDEyOklzc3VlQ29tbWVudDYwMTYxMjM4MA==,10194086,2020-03-20T09:45:23Z,2020-10-27T14:47:22Z,MEMBER,"tldr: if someone knows how to do memory profiling with reasonable effort this can still be changed
It's certainly not too late to change the ""backend"" of the weighting functions. I once tried to profile the memory usage but I gave up at some point (I think I would have needed to annotate a ton of functions, also in numpy).
@fujiisoup suggested using `xr.dot(a, b)` (instead of `(a * b).sum()`) to ameliorate part of the memory footprint. ~Which is done, however, this comes at a performance penalty. So an alternative code path might actually be beneficial.~ (edit: I now think `xr.dot` is actually faster, except for very small arrays).
Also `mask` is an array of dtype bool, which should help.
It think it should not be very difficult to write something that can be passed to `apply_ufinc`, probably similar to:
https://github.com/pydata/xarray/blob/e8a284f341645a63a4d83676a6b268394c721bbc/xarray/tests/test_weighted.py#L161
So there would be three possibilities: (1) the current implementation (using `xr.dot(a, b)`) (2) something similar to `expected_weighted` (using `(a * b).sum()`) (3) `xr.apply_ufunc(a, b, expected_weighted, ...)`. I assume (2) is fastest with the largest memory footprint, but I cannot tell about (1) and (3).
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601824129,https://api.github.com/repos/pydata/xarray/issues/2922,601824129,MDEyOklzc3VlQ29tbWVudDYwMTgyNDEyOQ==,10194086,2020-03-20T17:31:15Z,2020-03-20T17:31:15Z,MEMBER,"There is some stuff I can do to reduce the memory footprint if `skipna=False` or `not da.isnull().any()`. Also, the functions should support `dask` arrays out of the box.
---
> ideally `dot()` would support `skipna`, so you could eliminate the `da = da.fillna(0.0)` and pass the `skipna` down the line. But alas it doesn't...
Yes, this would be nice. `xr.dot` uses `np.einsum` which is [quite a beast](https://github.com/numpy/numpy/blob/v1.17.0/numpy/core/einsumfunc.py) that I don't entirely see through. I don't expect it to support `NaN`s any time soon.
What could be done, though is to only do `da = da.fillna(0.0)` if `da` contains `NaN`s.
> `(da * weights).sum(dim=dim, skipna=skipna)` would likely make things worse, I think, as it would necessarily create a temporary array of sized at least `da`, no?
I assume so. I don't know what kind of temporary variables `np.einsum` creates. Also `np.einsum` is wrapped in `xr.apply_ufunc` so all kinds of magic is going on.
> Either way, this only addresses the `da = da.fillna(0.0)`, not the `mask = da.notnull()`.
Again this could be avoided if `skipna=False` or if (and only if) there are no `NaN`s in da.
> Also, perhaps the test `if weights.isnull().any()` in `Weighted.__init__()` should be optional?
Do you want to leave it away for performance reasons? Because it was a deliberate decision to *not* support `NaN`s in the weights and I don't think this is going to change.
> Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays.
No it's important to make sure this stuff works for large arrays. However, using `xr.dot` already gives quite a performance penalty, which I am not super happy about.
> have you considered using these functions? [...]
None of your suggested functions support `NaNs` so they won't work.
I am all in to support more functions, but currently I am happy we got a weighted sum and mean into xarray after 5(!) years!
Further libraries that support weighted operations:
* [esmlab](https://github.com/NCAR/esmlab/blob/master/esmlab/statistics.py) (xarray-based, supports NaN)
* [statsmodels](https://www.statsmodels.org/dev/_modules/statsmodels/stats/weightstats.html) (numpy-based, does not support NaN)
","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601214104,https://api.github.com/repos/pydata/xarray/issues/2922,601214104,MDEyOklzc3VlQ29tbWVudDYwMTIxNDEwNA==,10194086,2020-03-19T14:35:25Z,2020-03-19T14:35:25Z,MEMBER,Great! Thanks for all the feedback and support!,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-595373665,https://api.github.com/repos/pydata/xarray/issues/2922,595373665,MDEyOklzc3VlQ29tbWVudDU5NTM3MzY2NQ==,10194086,2020-03-05T18:18:22Z,2020-03-05T18:18:22Z,MEMBER,I updated this once more. Mostly moved the example to a notebook.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-562206026,https://api.github.com/repos/pydata/xarray/issues/2922,562206026,MDEyOklzc3VlQ29tbWVudDU2MjIwNjAyNg==,10194086,2019-12-05T16:29:51Z,2019-12-05T16:29:51Z,MEMBER,This is now ready for a full review. I added tests for weighted reductions over several dimensions and docs.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-545512847,https://api.github.com/repos/pydata/xarray/issues/2922,545512847,MDEyOklzc3VlQ29tbWVudDU0NTUxMjg0Nw==,10194086,2019-10-23T15:55:35Z,2019-10-23T15:55:35Z,MEMBER,">> I decided to replace all NaN in the weights with 0.
> Can we raise an error instead? It should be easy for the user to do weights.fillna(0) instead of relying on xarray's magical behaviour.
I agree, requiring valid weights is a sensible choice.
>> if weights sum to 0 it returns NaN (and not inf)
> Should we raise an error here?
Im not sure... Assume I want to do a meridional mean and only have data over land, this would then raise an error, which is not what I want.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-543358453,https://api.github.com/repos/pydata/xarray/issues/2922,543358453,MDEyOklzc3VlQ29tbWVudDU0MzM1ODQ1Mw==,10194086,2019-10-17T20:56:32Z,2019-10-17T20:59:08Z,MEMBER,"I finally made some time to work on this - altough I feel far from finished...
* added a `DatasetWeighted` class
* for this I pulled the functionality our of `DataArrayWeighted` class in to own functions taking `da` and `weights` as input
* the tests need more work
* implanted the functionality using `xr.dot` -> this makes the logic a bit more complicated
* I think the failure in `Linux py37-upstream-dev` is unrelated
Questions:
* does this implementation look reasonable to you?
* `xr.dot` does not have a `axis` keyword is it fine if I leave it out in my functions?
* flake8 fails because I use `@overload` for typing -> should I remove this?
* Currently I have the functionality 3-times: once as `_weighted_sum`, once as `da.weighted.sum()` and once as `ds.weighted().sum`: how do I best test this?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-512243216,https://api.github.com/repos/pydata/xarray/issues/2922,512243216,MDEyOklzc3VlQ29tbWVudDUxMjI0MzIxNg==,10194086,2019-07-17T12:59:16Z,2019-07-17T12:59:16Z,MEMBER,"Thanks, I am still very interested to get this in. I don't think I'll manage before my holidays, though.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-488031173,https://api.github.com/repos/pydata/xarray/issues/2922,488031173,MDEyOklzc3VlQ29tbWVudDQ4ODAzMTE3Mw==,10194086,2019-04-30T16:57:05Z,2019-04-30T16:57:05Z,MEMBER,"I updated the PR
* added a weighted `sum` method
* tried to clean the tests
Before I continue, it would be nice to get some feedback.
* Currently I only do very simple tests - is that enough? How could these be generalized without re-implementing the functions to obtain `expected`.
* Eventually it would be cool to be able to do `da.groupby(regions).weighted(weights).mean()` - do you see a possibility for this with the current implementation?
As mentioned by @aaronspring, esmlab already implemented [weighted statistic functions](
https://github.com/NCAR/esmlab/blob/master/esmlab/statistics.py). Similarly, statsmodels for 1D data without handling of NaNs ([docs](https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.DescrStatsW.html#statsmodels.stats.weightstats.DescrStatsW) / [code](https://www.statsmodels.org/dev/_modules/statsmodels/stats/weightstats.html#DescrStatsW)). Thus it should be feasible to implement further statistics here (weighted `std`).
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416