home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

22 rows where issue = 437765416 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 7

  • mathause 9
  • seth-p 5
  • max-sixty 3
  • dcherian 2
  • rabernat 1
  • jhamman 1
  • pep8speaks 1

author_association 3

  • MEMBER 16
  • CONTRIBUTOR 5
  • NONE 1

issue 1

  • Feature/weighted · 22 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
601612380 https://github.com/pydata/xarray/pull/2922#issuecomment-601612380 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTYxMjM4MA== mathause 10194086 2020-03-20T09:45:23Z 2020-10-27T14:47:22Z MEMBER

tldr: if someone knows how to do memory profiling with reasonable effort this can still be changed

It's certainly not too late to change the "backend" of the weighting functions. I once tried to profile the memory usage but I gave up at some point (I think I would have needed to annotate a ton of functions, also in numpy).

@fujiisoup suggested using xr.dot(a, b) (instead of (a * b).sum()) to ameliorate part of the memory footprint. ~Which is done, however, this comes at a performance penalty. So an alternative code path might actually be beneficial.~ (edit: I now think xr.dot is actually faster, except for very small arrays).

Also mask is an array of dtype bool, which should help.

It think it should not be very difficult to write something that can be passed to apply_ufinc, probably similar to:

https://github.com/pydata/xarray/blob/e8a284f341645a63a4d83676a6b268394c721bbc/xarray/tests/test_weighted.py#L161

So there would be three possibilities: (1) the current implementation (using xr.dot(a, b)) (2) something similar to expected_weighted (using (a * b).sum()) (3) xr.apply_ufunc(a, b, expected_weighted, ...). I assume (2) is fastest with the largest memory footprint, but I cannot tell about (1) and (3).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601885539 https://github.com/pydata/xarray/pull/2922#issuecomment-601885539 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTg4NTUzOQ== seth-p 7441788 2020-03-20T19:57:54Z 2020-03-20T20:00:20Z CONTRIBUTOR

All good points:

What could be done, though is to only do da = da.fillna(0.0) if da contains NaNs.

Good idea, though I don't know what the performance hit would be of the extra check (in the case that da does contain NaNs, so the check is for naught).

I assume so. I don't know what kind of temporary variables np.einsum creates. Also np.einsum is wrapped in xr.apply_ufunc so all kinds of magic is going on.

Well, (da * weights) will be at least as large as da. I'm not certain, but I don't think np.einsum creates huge temporary arrays.

Do you want to leave it away for performance reasons? Because it was a deliberate decision to not support NaNs in the weights and I don't think this is going to change.

Yes. You can continue not supporting NaNs in the weights, yet not explicitly check that there are no NaNs (optionally, if the caller assures you that there are no NaNs).

None of your suggested functions support NaNs so they won't work.

Correct. These have nothing to do with the NaNs issue.

For profiling memory usage, I use psutil.Process(os.getpid()).memory_info().rss for current usage and resource.getusage(resource.RUSAGE_SElF).ru_maxrss for peak usage (on linux).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601824129 https://github.com/pydata/xarray/pull/2922#issuecomment-601824129 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTgyNDEyOQ== mathause 10194086 2020-03-20T17:31:15Z 2020-03-20T17:31:15Z MEMBER

There is some stuff I can do to reduce the memory footprint if skipna=False or not da.isnull().any(). Also, the functions should support dask arrays out of the box.


ideally dot() would support skipna, so you could eliminate the da = da.fillna(0.0) and pass the skipna down the line. But alas it doesn't...

Yes, this would be nice. xr.dot uses np.einsum which is quite a beast that I don't entirely see through. I don't expect it to support NaNs any time soon.

What could be done, though is to only do da = da.fillna(0.0) if da contains NaNs.

(da * weights).sum(dim=dim, skipna=skipna) would likely make things worse, I think, as it would necessarily create a temporary array of sized at least da, no?

I assume so. I don't know what kind of temporary variables np.einsum creates. Also np.einsum is wrapped in xr.apply_ufunc so all kinds of magic is going on.

Either way, this only addresses the da = da.fillna(0.0), not the mask = da.notnull().

Again this could be avoided if skipna=False or if (and only if) there are no NaNs in da.

Also, perhaps the test if weights.isnull().any() in Weighted.__init__() should be optional?

Do you want to leave it away for performance reasons? Because it was a deliberate decision to not support NaNs in the weights and I don't think this is going to change.

Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays.

No it's important to make sure this stuff works for large arrays. However, using xr.dot already gives quite a performance penalty, which I am not super happy about.

have you considered using these functions? [...]

None of your suggested functions support NaNs so they won't work.

I am all in to support more functions, but currently I am happy we got a weighted sum and mean into xarray after 5(!) years!

Further libraries that support weighted operations:

  • esmlab (xarray-based, supports NaN)
  • statsmodels (numpy-based, does not support NaN)
{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601709733 https://github.com/pydata/xarray/pull/2922#issuecomment-601709733 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTcwOTczMw== seth-p 7441788 2020-03-20T13:47:39Z 2020-03-20T16:31:14Z CONTRIBUTOR

@mathause, have you considered using these functions? - np.average() to calculate weighted mean(). - np.cov() to calculate weighted cov(), var(), and std(). - sp.stats.cumfreq() to calculate weighted median() (I haven't thought this through). - sp.spatial.distance.correlation() to calculate weighted corrcoef(). (Of course one could also calculate this from weighted cov() (see above), but first need to mask the two arrays simultaneously.) - sklearn.utils.extmath.weighted_mode() to calculate weighted mode(). - gmisclib.weighted_percentile.{wp,wtd_median}() to calculate weighted quantile() and median().

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601708110 https://github.com/pydata/xarray/pull/2922#issuecomment-601708110 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTcwODExMA== seth-p 7441788 2020-03-20T13:44:03Z 2020-03-20T13:52:06Z CONTRIBUTOR

@mathause, ideally dot() would support skipna, so you could eliminate the da = da.fillna(0.0) and pass the skipna down the line. But alas it doesn't...

(da * weights).sum(dim=dim, skipna=skipna) would likely make things worse, I think, as it would necessarily create a temporary array of sized at least da, no?

Either way, this only addresses the da = da.fillna(0.0), not the mask = da.notnull().

Also, perhaps the test if weights.isnull().any() in Weighted.__init__() should be optional?

Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601699091 https://github.com/pydata/xarray/pull/2922#issuecomment-601699091 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTY5OTA5MQ== seth-p 7441788 2020-03-20T13:25:21Z 2020-03-20T13:25:21Z CONTRIBUTOR

@max-sixty, I wish I could, but I'm afraid that I cannot submit code due to employer limitations.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601514904 https://github.com/pydata/xarray/pull/2922#issuecomment-601514904 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTUxNDkwNA== max-sixty 5635139 2020-03-20T04:01:34Z 2020-03-20T04:01:34Z MEMBER

We do those sorts of operations fairly frequently, so it's not unique here. Generally users should expect to have available ~3x the memory of an array for most operations.

@seth-p it's great you've taken an interest in the project! Is there any chance we could harness that into some contributions? 😄

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601496897 https://github.com/pydata/xarray/pull/2922#issuecomment-601496897 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTQ5Njg5Nw== seth-p 7441788 2020-03-20T02:11:53Z 2020-03-20T02:12:24Z CONTRIBUTOR

I realize this is a bit late, but I'm still concerned about memory usage, specifically in https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L130 and https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L143. If da.sizes = {'dim_0': 100000, 'dim_1': 100000}, the two lines above will cause da.weighted(weights).mean('dim_0') to create two simultaneous temporary 100000x100000 arrays, which could be problematic.

I would have implemented this using apply_ufunc, so that one creates these temporary variables only on as small an array as absolutely necessary -- in this case just of size sizes['dim_0'] = 100000. (Much as I would like to, I'm afraid I'm not able to contribute code.) Of course this won't help in the case one is summing over all dimensions, but might as well minimize memory usage in some cases even if not in all.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601377953 https://github.com/pydata/xarray/pull/2922#issuecomment-601377953 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTM3Nzk1Mw== max-sixty 5635139 2020-03-19T19:34:42Z 2020-03-19T19:34:42Z MEMBER

422 was opened in June of 2015, amazing.

😂

@mathause props for the persistence...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601298407 https://github.com/pydata/xarray/pull/2922#issuecomment-601298407 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTI5ODQwNw== jhamman 2443309 2020-03-19T16:58:57Z 2020-03-19T16:58:57Z MEMBER

Big time!!!! Thanks @mathause! #422 was opened in June of 2015, amazing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601283025 https://github.com/pydata/xarray/pull/2922#issuecomment-601283025 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTI4MzAyNQ== max-sixty 5635139 2020-03-19T16:37:43Z 2020-03-19T16:37:43Z MEMBER

Thanks @mathause !

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601214104 https://github.com/pydata/xarray/pull/2922#issuecomment-601214104 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTIxNDEwNA== mathause 10194086 2020-03-19T14:35:25Z 2020-03-19T14:35:25Z MEMBER

Great! Thanks for all the feedback and support!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601210885 https://github.com/pydata/xarray/pull/2922#issuecomment-601210885 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTIxMDg4NQ== dcherian 2448579 2020-03-19T14:29:42Z 2020-03-19T14:29:42Z MEMBER

This is going in.

Thanks @mathause. This is a major contribution!

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
487130921 https://github.com/pydata/xarray/pull/2922#issuecomment-487130921 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDQ4NzEzMDkyMQ== pep8speaks 24736507 2019-04-26T17:09:07Z 2020-03-18T01:42:06Z NONE

Hello @mathause! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2020-03-18 01:42:05 UTC
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
595373665 https://github.com/pydata/xarray/pull/2922#issuecomment-595373665 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDU5NTM3MzY2NQ== mathause 10194086 2020-03-05T18:18:22Z 2020-03-05T18:18:22Z MEMBER

I updated this once more. Mostly moved the example to a notebook.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
562206026 https://github.com/pydata/xarray/pull/2922#issuecomment-562206026 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDU2MjIwNjAyNg== mathause 10194086 2019-12-05T16:29:51Z 2019-12-05T16:29:51Z MEMBER

This is now ready for a full review. I added tests for weighted reductions over several dimensions and docs.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
545512847 https://github.com/pydata/xarray/pull/2922#issuecomment-545512847 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDU0NTUxMjg0Nw== mathause 10194086 2019-10-23T15:55:35Z 2019-10-23T15:55:35Z MEMBER

I decided to replace all NaN in the weights with 0. Can we raise an error instead? It should be easy for the user to do weights.fillna(0) instead of relying on xarray's magical behaviour.

I agree, requiring valid weights is a sensible choice.

if weights sum to 0 it returns NaN (and not inf) Should we raise an error here?

Im not sure... Assume I want to do a meridional mean and only have data over land, this would then raise an error, which is not what I want.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
545200082 https://github.com/pydata/xarray/pull/2922#issuecomment-545200082 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDU0NTIwMDA4Mg== dcherian 2448579 2019-10-22T23:35:52Z 2019-10-22T23:35:52Z MEMBER

I decided to replace all NaN in the weights with 0.

Can we raise an error instead? It should be easy for the user to do weights.fillna(0) instead of relying on xarray's magical behaviour.

if weights sum to 0 it returns NaN (and not inf)

Should we raise an error here?

The following returns NaN (could be 1)

I think NaN is fine since that's the result of 1 + 0*np.nan

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
543358453 https://github.com/pydata/xarray/pull/2922#issuecomment-543358453 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDU0MzM1ODQ1Mw== mathause 10194086 2019-10-17T20:56:32Z 2019-10-17T20:59:08Z MEMBER

I finally made some time to work on this - altough I feel far from finished...

  • added a DatasetWeighted class
  • for this I pulled the functionality our of DataArrayWeighted class in to own functions taking da and weights as input
  • the tests need more work
  • implanted the functionality using xr.dot -> this makes the logic a bit more complicated
  • I think the failure in Linux py37-upstream-dev is unrelated

Questions: * does this implementation look reasonable to you? * xr.dot does not have a axis keyword is it fine if I leave it out in my functions? * flake8 fails because I use @overload for typing -> should I remove this? * Currently I have the functionality 3-times: once as _weighted_sum, once as da.weighted.sum() and once as ds.weighted().sum: how do I best test this?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
512243216 https://github.com/pydata/xarray/pull/2922#issuecomment-512243216 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDUxMjI0MzIxNg== mathause 10194086 2019-07-17T12:59:16Z 2019-07-17T12:59:16Z MEMBER

Thanks, I am still very interested to get this in. I don't think I'll manage before my holidays, though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
511002355 https://github.com/pydata/xarray/pull/2922#issuecomment-511002355 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDUxMTAwMjM1NQ== rabernat 1197350 2019-07-12T19:16:16Z 2019-07-12T19:16:16Z MEMBER

Hi @mathause - We really appreciate your contribution. Sorry your PR has stalled!

Do you think you can respond to @fujiisoup's review and add documentation? Then we can get this merged.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
488031173 https://github.com/pydata/xarray/pull/2922#issuecomment-488031173 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDQ4ODAzMTE3Mw== mathause 10194086 2019-04-30T16:57:05Z 2019-04-30T16:57:05Z MEMBER

I updated the PR * added a weighted sum method * tried to clean the tests

Before I continue, it would be nice to get some feedback.

  • Currently I only do very simple tests - is that enough? How could these be generalized without re-implementing the functions to obtain expected.
  • Eventually it would be cool to be able to do da.groupby(regions).weighted(weights).mean() - do you see a possibility for this with the current implementation?

As mentioned by @aaronspring, esmlab already implemented weighted statistic functions. Similarly, statsmodels for 1D data without handling of NaNs (docs / code). Thus it should be feasible to implement further statistics here (weighted std).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 31.053ms · About: xarray-datasette