html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/pull/2922#issuecomment-601612380,https://api.github.com/repos/pydata/xarray/issues/2922,601612380,MDEyOklzc3VlQ29tbWVudDYwMTYxMjM4MA==,10194086,2020-03-20T09:45:23Z,2020-10-27T14:47:22Z,MEMBER,"tldr: if someone knows how to do memory profiling with reasonable effort this can still be changed It's certainly not too late to change the ""backend"" of the weighting functions. I once tried to profile the memory usage but I gave up at some point (I think I would have needed to annotate a ton of functions, also in numpy). @fujiisoup suggested using `xr.dot(a, b)` (instead of `(a * b).sum()`) to ameliorate part of the memory footprint. ~Which is done, however, this comes at a performance penalty. So an alternative code path might actually be beneficial.~ (edit: I now think `xr.dot` is actually faster, except for very small arrays). Also `mask` is an array of dtype bool, which should help. It think it should not be very difficult to write something that can be passed to `apply_ufinc`, probably similar to: https://github.com/pydata/xarray/blob/e8a284f341645a63a4d83676a6b268394c721bbc/xarray/tests/test_weighted.py#L161 So there would be three possibilities: (1) the current implementation (using `xr.dot(a, b)`) (2) something similar to `expected_weighted` (using `(a * b).sum()`) (3) `xr.apply_ufunc(a, b, expected_weighted, ...)`. I assume (2) is fastest with the largest memory footprint, but I cannot tell about (1) and (3). ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416 https://github.com/pydata/xarray/pull/2922#issuecomment-601824129,https://api.github.com/repos/pydata/xarray/issues/2922,601824129,MDEyOklzc3VlQ29tbWVudDYwMTgyNDEyOQ==,10194086,2020-03-20T17:31:15Z,2020-03-20T17:31:15Z,MEMBER,"There is some stuff I can do to reduce the memory footprint if `skipna=False` or `not da.isnull().any()`. Also, the functions should support `dask` arrays out of the box. --- > ideally `dot()` would support `skipna`, so you could eliminate the `da = da.fillna(0.0)` and pass the `skipna` down the line. But alas it doesn't... Yes, this would be nice. `xr.dot` uses `np.einsum` which is [quite a beast](https://github.com/numpy/numpy/blob/v1.17.0/numpy/core/einsumfunc.py) that I don't entirely see through. I don't expect it to support `NaN`s any time soon. What could be done, though is to only do `da = da.fillna(0.0)` if `da` contains `NaN`s. > `(da * weights).sum(dim=dim, skipna=skipna)` would likely make things worse, I think, as it would necessarily create a temporary array of sized at least `da`, no? I assume so. I don't know what kind of temporary variables `np.einsum` creates. Also `np.einsum` is wrapped in `xr.apply_ufunc` so all kinds of magic is going on. > Either way, this only addresses the `da = da.fillna(0.0)`, not the `mask = da.notnull()`. Again this could be avoided if `skipna=False` or if (and only if) there are no `NaN`s in da. > Also, perhaps the test `if weights.isnull().any()` in `Weighted.__init__()` should be optional? Do you want to leave it away for performance reasons? Because it was a deliberate decision to *not* support `NaN`s in the weights and I don't think this is going to change. > Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays. No it's important to make sure this stuff works for large arrays. However, using `xr.dot` already gives quite a performance penalty, which I am not super happy about. > have you considered using these functions? [...] None of your suggested functions support `NaNs` so they won't work. I am all in to support more functions, but currently I am happy we got a weighted sum and mean into xarray after 5(!) years! Further libraries that support weighted operations: * [esmlab](https://github.com/NCAR/esmlab/blob/master/esmlab/statistics.py) (xarray-based, supports NaN) * [statsmodels](https://www.statsmodels.org/dev/_modules/statsmodels/stats/weightstats.html) (numpy-based, does not support NaN) ","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416 https://github.com/pydata/xarray/pull/2922#issuecomment-601214104,https://api.github.com/repos/pydata/xarray/issues/2922,601214104,MDEyOklzc3VlQ29tbWVudDYwMTIxNDEwNA==,10194086,2020-03-19T14:35:25Z,2020-03-19T14:35:25Z,MEMBER,Great! Thanks for all the feedback and support!,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416 https://github.com/pydata/xarray/pull/2922#issuecomment-595373665,https://api.github.com/repos/pydata/xarray/issues/2922,595373665,MDEyOklzc3VlQ29tbWVudDU5NTM3MzY2NQ==,10194086,2020-03-05T18:18:22Z,2020-03-05T18:18:22Z,MEMBER,I updated this once more. Mostly moved the example to a notebook.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416 https://github.com/pydata/xarray/pull/2922#issuecomment-562206026,https://api.github.com/repos/pydata/xarray/issues/2922,562206026,MDEyOklzc3VlQ29tbWVudDU2MjIwNjAyNg==,10194086,2019-12-05T16:29:51Z,2019-12-05T16:29:51Z,MEMBER,This is now ready for a full review. I added tests for weighted reductions over several dimensions and docs.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416 https://github.com/pydata/xarray/pull/2922#issuecomment-545512847,https://api.github.com/repos/pydata/xarray/issues/2922,545512847,MDEyOklzc3VlQ29tbWVudDU0NTUxMjg0Nw==,10194086,2019-10-23T15:55:35Z,2019-10-23T15:55:35Z,MEMBER,">> I decided to replace all NaN in the weights with 0. > Can we raise an error instead? It should be easy for the user to do weights.fillna(0) instead of relying on xarray's magical behaviour. I agree, requiring valid weights is a sensible choice. >> if weights sum to 0 it returns NaN (and not inf) > Should we raise an error here? Im not sure... Assume I want to do a meridional mean and only have data over land, this would then raise an error, which is not what I want. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416 https://github.com/pydata/xarray/pull/2922#issuecomment-543358453,https://api.github.com/repos/pydata/xarray/issues/2922,543358453,MDEyOklzc3VlQ29tbWVudDU0MzM1ODQ1Mw==,10194086,2019-10-17T20:56:32Z,2019-10-17T20:59:08Z,MEMBER,"I finally made some time to work on this - altough I feel far from finished... * added a `DatasetWeighted` class * for this I pulled the functionality our of `DataArrayWeighted` class in to own functions taking `da` and `weights` as input * the tests need more work * implanted the functionality using `xr.dot` -> this makes the logic a bit more complicated * I think the failure in `Linux py37-upstream-dev` is unrelated Questions: * does this implementation look reasonable to you? * `xr.dot` does not have a `axis` keyword is it fine if I leave it out in my functions? * flake8 fails because I use `@overload` for typing -> should I remove this? * Currently I have the functionality 3-times: once as `_weighted_sum`, once as `da.weighted.sum()` and once as `ds.weighted().sum`: how do I best test this?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416 https://github.com/pydata/xarray/pull/2922#issuecomment-512243216,https://api.github.com/repos/pydata/xarray/issues/2922,512243216,MDEyOklzc3VlQ29tbWVudDUxMjI0MzIxNg==,10194086,2019-07-17T12:59:16Z,2019-07-17T12:59:16Z,MEMBER,"Thanks, I am still very interested to get this in. I don't think I'll manage before my holidays, though.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416 https://github.com/pydata/xarray/pull/2922#issuecomment-488031173,https://api.github.com/repos/pydata/xarray/issues/2922,488031173,MDEyOklzc3VlQ29tbWVudDQ4ODAzMTE3Mw==,10194086,2019-04-30T16:57:05Z,2019-04-30T16:57:05Z,MEMBER,"I updated the PR * added a weighted `sum` method * tried to clean the tests Before I continue, it would be nice to get some feedback. * Currently I only do very simple tests - is that enough? How could these be generalized without re-implementing the functions to obtain `expected`. * Eventually it would be cool to be able to do `da.groupby(regions).weighted(weights).mean()` - do you see a possibility for this with the current implementation? As mentioned by @aaronspring, esmlab already implemented [weighted statistic functions]( https://github.com/NCAR/esmlab/blob/master/esmlab/statistics.py). Similarly, statsmodels for 1D data without handling of NaNs ([docs](https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.DescrStatsW.html#statsmodels.stats.weightstats.DescrStatsW) / [code](https://www.statsmodels.org/dev/_modules/statsmodels/stats/weightstats.html#DescrStatsW)). Thus it should be feasible to implement further statistics here (weighted `std`). ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416