html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/2922#issuecomment-601885539,https://api.github.com/repos/pydata/xarray/issues/2922,601885539,MDEyOklzc3VlQ29tbWVudDYwMTg4NTUzOQ==,7441788,2020-03-20T19:57:54Z,2020-03-20T20:00:20Z,CONTRIBUTOR,"All good points:
> What could be done, though is to only do da = da.fillna(0.0) if da contains NaNs.
Good idea, though I don't know what the performance hit would be of the extra check (in the case that da does contain NaNs, so the check is for naught).
> I assume so. I don't know what kind of temporary variables np.einsum creates. Also np.einsum is wrapped in xr.apply_ufunc so all kinds of magic is going on.
Well, `(da * weights)` will be at least as large as `da`. I'm not certain, but I don't think np.einsum creates huge temporary arrays.
> Do you want to leave it away for performance reasons? Because it was a deliberate decision to not support NaNs in the weights and I don't think this is going to change.
Yes. You can continue not supporting NaNs in the weights, yet not explicitly check that there are no NaNs (optionally, if the caller assures you that there are no NaNs).
> None of your suggested functions support NaNs so they won't work.
Correct. These have nothing to do with the NaNs issue.
For profiling memory usage, I use `psutil.Process(os.getpid()).memory_info().rss` for current usage and `resource.getusage(resource.RUSAGE_SElF).ru_maxrss` for peak usage (on linux).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601709733,https://api.github.com/repos/pydata/xarray/issues/2922,601709733,MDEyOklzc3VlQ29tbWVudDYwMTcwOTczMw==,7441788,2020-03-20T13:47:39Z,2020-03-20T16:31:14Z,CONTRIBUTOR,"@mathause, have you considered using these functions?
- [np.average()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html) to calculate weighted `mean()`.
- [np.cov()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html) to calculate weighted `cov()`, `var()`, and `std()`.
- [sp.stats.cumfreq()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.cumfreq.html) to calculate weighted `median()` (I haven't thought this through).
- [sp.spatial.distance.correlation()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.correlation.html) to calculate weighted `corrcoef()`. (Of course one could also calculate this from weighted `cov()` (see above), but first need to mask the two arrays simultaneously.)
- [sklearn.utils.extmath.weighted_mode()](https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.weighted_mode.html) to calculate weighted `mode()`.
- [gmisclib.weighted_percentile.{wp,wtd_median}()](http://kochanski.org/gpk/code/speechresearch/gmisclib/gmisclib.weighted_percentile-module.html) to calculate weighted `quantile()` and `median()`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601708110,https://api.github.com/repos/pydata/xarray/issues/2922,601708110,MDEyOklzc3VlQ29tbWVudDYwMTcwODExMA==,7441788,2020-03-20T13:44:03Z,2020-03-20T13:52:06Z,CONTRIBUTOR,"@mathause, ideally `dot()` would support `skipna`, so you could eliminate the `da = da.fillna(0.0)` and pass the `skipna` down the line. But alas it doesn't...
`(da * weights).sum(dim=dim, skipna=skipna)` would likely make things worse, I think, as it would necessarily create a temporary array of sized at least `da`, no?
Either way, this only addresses the `da = da.fillna(0.0)`, not the `mask = da.notnull()`.
Also, perhaps the test `if weights.isnull().any()` in `Weighted.__init__()` should be optional?
Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601699091,https://api.github.com/repos/pydata/xarray/issues/2922,601699091,MDEyOklzc3VlQ29tbWVudDYwMTY5OTA5MQ==,7441788,2020-03-20T13:25:21Z,2020-03-20T13:25:21Z,CONTRIBUTOR,"@max-sixty, I wish I could, but I'm afraid that I cannot submit code due to employer limitations.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416
https://github.com/pydata/xarray/pull/2922#issuecomment-601496897,https://api.github.com/repos/pydata/xarray/issues/2922,601496897,MDEyOklzc3VlQ29tbWVudDYwMTQ5Njg5Nw==,7441788,2020-03-20T02:11:53Z,2020-03-20T02:12:24Z,CONTRIBUTOR,"I realize this is a bit late, but I'm still concerned about memory usage, specifically in https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L130 and https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L143.
If `da.sizes = {'dim_0': 100000, 'dim_1': 100000}`, the two lines above will cause `da.weighted(weights).mean('dim_0')` to create two simultaneous temporary 100000x100000 arrays, which could be problematic.
I would have implemented this using ``apply_ufunc``, so that one creates these temporary variables only on as small an array as absolutely necessary -- in this case just of size `sizes['dim_0'] = 100000`. (Much as I would like to, I'm afraid I'm not able to contribute code.) Of course this won't help in the case one is summing over all dimensions, but might as well minimize memory usage in some cases even if not in all.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,437765416