home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

9 rows where author_association = "MEMBER", issue = 437765416 and user = 10194086 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • mathause · 9 ✖

issue 1

  • Feature/weighted · 9 ✖

author_association 1

  • MEMBER · 9 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
601612380 https://github.com/pydata/xarray/pull/2922#issuecomment-601612380 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTYxMjM4MA== mathause 10194086 2020-03-20T09:45:23Z 2020-10-27T14:47:22Z MEMBER

tldr: if someone knows how to do memory profiling with reasonable effort this can still be changed

It's certainly not too late to change the "backend" of the weighting functions. I once tried to profile the memory usage but I gave up at some point (I think I would have needed to annotate a ton of functions, also in numpy).

@fujiisoup suggested using xr.dot(a, b) (instead of (a * b).sum()) to ameliorate part of the memory footprint. ~Which is done, however, this comes at a performance penalty. So an alternative code path might actually be beneficial.~ (edit: I now think xr.dot is actually faster, except for very small arrays).

Also mask is an array of dtype bool, which should help.

It think it should not be very difficult to write something that can be passed to apply_ufinc, probably similar to:

https://github.com/pydata/xarray/blob/e8a284f341645a63a4d83676a6b268394c721bbc/xarray/tests/test_weighted.py#L161

So there would be three possibilities: (1) the current implementation (using xr.dot(a, b)) (2) something similar to expected_weighted (using (a * b).sum()) (3) xr.apply_ufunc(a, b, expected_weighted, ...). I assume (2) is fastest with the largest memory footprint, but I cannot tell about (1) and (3).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601824129 https://github.com/pydata/xarray/pull/2922#issuecomment-601824129 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTgyNDEyOQ== mathause 10194086 2020-03-20T17:31:15Z 2020-03-20T17:31:15Z MEMBER

There is some stuff I can do to reduce the memory footprint if skipna=False or not da.isnull().any(). Also, the functions should support dask arrays out of the box.


ideally dot() would support skipna, so you could eliminate the da = da.fillna(0.0) and pass the skipna down the line. But alas it doesn't...

Yes, this would be nice. xr.dot uses np.einsum which is quite a beast that I don't entirely see through. I don't expect it to support NaNs any time soon.

What could be done, though is to only do da = da.fillna(0.0) if da contains NaNs.

(da * weights).sum(dim=dim, skipna=skipna) would likely make things worse, I think, as it would necessarily create a temporary array of sized at least da, no?

I assume so. I don't know what kind of temporary variables np.einsum creates. Also np.einsum is wrapped in xr.apply_ufunc so all kinds of magic is going on.

Either way, this only addresses the da = da.fillna(0.0), not the mask = da.notnull().

Again this could be avoided if skipna=False or if (and only if) there are no NaNs in da.

Also, perhaps the test if weights.isnull().any() in Weighted.__init__() should be optional?

Do you want to leave it away for performance reasons? Because it was a deliberate decision to not support NaNs in the weights and I don't think this is going to change.

Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays.

No it's important to make sure this stuff works for large arrays. However, using xr.dot already gives quite a performance penalty, which I am not super happy about.

have you considered using these functions? [...]

None of your suggested functions support NaNs so they won't work.

I am all in to support more functions, but currently I am happy we got a weighted sum and mean into xarray after 5(!) years!

Further libraries that support weighted operations:

  • esmlab (xarray-based, supports NaN)
  • statsmodels (numpy-based, does not support NaN)
{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
601214104 https://github.com/pydata/xarray/pull/2922#issuecomment-601214104 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDYwMTIxNDEwNA== mathause 10194086 2020-03-19T14:35:25Z 2020-03-19T14:35:25Z MEMBER

Great! Thanks for all the feedback and support!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
595373665 https://github.com/pydata/xarray/pull/2922#issuecomment-595373665 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDU5NTM3MzY2NQ== mathause 10194086 2020-03-05T18:18:22Z 2020-03-05T18:18:22Z MEMBER

I updated this once more. Mostly moved the example to a notebook.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
562206026 https://github.com/pydata/xarray/pull/2922#issuecomment-562206026 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDU2MjIwNjAyNg== mathause 10194086 2019-12-05T16:29:51Z 2019-12-05T16:29:51Z MEMBER

This is now ready for a full review. I added tests for weighted reductions over several dimensions and docs.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
545512847 https://github.com/pydata/xarray/pull/2922#issuecomment-545512847 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDU0NTUxMjg0Nw== mathause 10194086 2019-10-23T15:55:35Z 2019-10-23T15:55:35Z MEMBER

I decided to replace all NaN in the weights with 0. Can we raise an error instead? It should be easy for the user to do weights.fillna(0) instead of relying on xarray's magical behaviour.

I agree, requiring valid weights is a sensible choice.

if weights sum to 0 it returns NaN (and not inf) Should we raise an error here?

Im not sure... Assume I want to do a meridional mean and only have data over land, this would then raise an error, which is not what I want.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
543358453 https://github.com/pydata/xarray/pull/2922#issuecomment-543358453 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDU0MzM1ODQ1Mw== mathause 10194086 2019-10-17T20:56:32Z 2019-10-17T20:59:08Z MEMBER

I finally made some time to work on this - altough I feel far from finished...

  • added a DatasetWeighted class
  • for this I pulled the functionality our of DataArrayWeighted class in to own functions taking da and weights as input
  • the tests need more work
  • implanted the functionality using xr.dot -> this makes the logic a bit more complicated
  • I think the failure in Linux py37-upstream-dev is unrelated

Questions: * does this implementation look reasonable to you? * xr.dot does not have a axis keyword is it fine if I leave it out in my functions? * flake8 fails because I use @overload for typing -> should I remove this? * Currently I have the functionality 3-times: once as _weighted_sum, once as da.weighted.sum() and once as ds.weighted().sum: how do I best test this?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
512243216 https://github.com/pydata/xarray/pull/2922#issuecomment-512243216 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDUxMjI0MzIxNg== mathause 10194086 2019-07-17T12:59:16Z 2019-07-17T12:59:16Z MEMBER

Thanks, I am still very interested to get this in. I don't think I'll manage before my holidays, though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416
488031173 https://github.com/pydata/xarray/pull/2922#issuecomment-488031173 https://api.github.com/repos/pydata/xarray/issues/2922 MDEyOklzc3VlQ29tbWVudDQ4ODAzMTE3Mw== mathause 10194086 2019-04-30T16:57:05Z 2019-04-30T16:57:05Z MEMBER

I updated the PR * added a weighted sum method * tried to clean the tests

Before I continue, it would be nice to get some feedback.

  • Currently I only do very simple tests - is that enough? How could these be generalized without re-implementing the functions to obtain expected.
  • Eventually it would be cool to be able to do da.groupby(regions).weighted(weights).mean() - do you see a possibility for this with the current implementation?

As mentioned by @aaronspring, esmlab already implemented weighted statistic functions. Similarly, statsmodels for 1D data without handling of NaNs (docs / code). Thus it should be feasible to implement further statistics here (weighted std).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature/weighted 437765416

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 68.08ms · About: xarray-datasette