home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where issue = 1575938277 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • Thomas-Z 4
  • dcherian 3
  • headtr1ck 3

author_association 3

  • CONTRIBUTOR 4
  • COLLABORATOR 3
  • MEMBER 3

issue 1

  • Dataset.where performances regression. · 10 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1532601237 https://github.com/pydata/xarray/issues/7516#issuecomment-1532601237 https://api.github.com/repos/pydata/xarray/issues/7516 IC_kwDOAMm_X85bWaOV Thomas-Z 1492047 2023-05-03T07:58:22Z 2023-05-03T07:58:22Z CONTRIBUTOR

Hello,

I'm not sure performances problematics were fully addressed (we're now forced to fully compute/load the selection expression) but changes made in the last versions makes this issue irrelevant and I think we can close it.

Thank you!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset.where performances regression. 1575938277
1451754167 https://github.com/pydata/xarray/issues/7516#issuecomment-1451754167 https://api.github.com/repos/pydata/xarray/issues/7516 IC_kwDOAMm_X85WiAK3 Thomas-Z 1492047 2023-03-02T11:59:47Z 2023-03-02T11:59:47Z CONTRIBUTOR

The .variable computation is fast but it cannot be directly used like you suggest: ``` dsx.where(sel.variable, drop=True)

TypeError: cond argument is <xarray.Variable (num_lines: 5761870, num_pixels: 71)> ... but must be a <class 'xarray.core.dataset.Dataset'> or <class 'xarray.core.dataarray.DataArray'> ```

Doing it like this seems to be working correctly (and is fast enough): dsx["x"]= sel.variable.compute() dsx.where(dsx["x"], drop=True)

_nadir variables have the same chunks and are way faster to read than the other ones (lot smaller).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset.where performances regression. 1575938277
1450712889 https://github.com/pydata/xarray/issues/7516#issuecomment-1450712889 https://api.github.com/repos/pydata/xarray/issues/7516 IC_kwDOAMm_X85WeB85 dcherian 2448579 2023-03-01T19:10:15Z 2023-03-01T19:10:15Z MEMBER

Yeah that was another change I guess. We could extract out the variable using .variable.

.where(sel2.variable.compute(), drop=True)

do your "_nadir" variables have smaller chunk sizes or are slower to read for some reason?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset.where performances regression. 1575938277
1449714522 https://github.com/pydata/xarray/issues/7516#issuecomment-1449714522 https://api.github.com/repos/pydata/xarray/issues/7516 IC_kwDOAMm_X85WaONa Thomas-Z 1492047 2023-03-01T09:43:27Z 2023-03-01T09:43:27Z CONTRIBUTOR

sel = (dsx["longitude"] > 0) & (dsx["longitude"] < 100) sel.compute() This "compute" finishes and takes more than 80sec on both versions with a huge memory consumption (it loads the 4 coordinates and the result itself).

I know xarray has to keep more information regarding coordinates and dimensions but doing this (just dask arrays) : sel2 = (dsx["longitude"].data > 0) & (dsx["longitude"].data < 100) sel2.compute() Takes less than 6 seconds.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset.where performances regression. 1575938277
1449085012 https://github.com/pydata/xarray/issues/7516#issuecomment-1449085012 https://api.github.com/repos/pydata/xarray/issues/7516 IC_kwDOAMm_X85WX0hU dcherian 2448579 2023-02-28T23:30:59Z 2023-02-28T23:30:59Z MEMBER

Does sel.compute() not finish?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset.where performances regression. 1575938277
1447798846 https://github.com/pydata/xarray/issues/7516#issuecomment-1447798846 https://api.github.com/repos/pydata/xarray/issues/7516 IC_kwDOAMm_X85WS6g- Thomas-Z 1492047 2023-02-28T08:54:16Z 2023-02-28T11:24:11Z CONTRIBUTOR

Just tried it and it does not seem identical at all to what was happening earlier.

This is the kind of dataset I'm working

With this selection: sel = (dsx["longitude"] > 0) & (dsx["longitude"] < 100)

Old xarray takes a little less that 1 minute and less than 6GB of memory. New xarray with compute did not finish and had to be stopped before consuming my 16GB of memory.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset.where performances regression. 1575938277
1447565936 https://github.com/pydata/xarray/issues/7516#issuecomment-1447565936 https://api.github.com/repos/pydata/xarray/issues/7516 IC_kwDOAMm_X85WSBpw dcherian 2448579 2023-02-28T04:41:03Z 2023-02-28T04:41:03Z MEMBER

The old code had: nonzeros = zip(clipcond.dims, np.nonzero(clipcond.values))

This loaded the array once and then passed numpy values to the indexing code.

Now, the dask array is passed to the indexing code and is computed many times . #5873 raises an error saying boolean indexing with dask arrays is not allowed.

For here just do ds.where(sel.compute(), drop=True). It's identical to what was happening earlier.

I think we should close this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset.where performances regression. 1575938277
1447037080 https://github.com/pydata/xarray/issues/7516#issuecomment-1447037080 https://api.github.com/repos/pydata/xarray/issues/7516 IC_kwDOAMm_X85WQAiY headtr1ck 43316012 2023-02-27T20:27:52Z 2023-02-27T20:27:52Z COLLABORATOR

I am a bit puzzled here... The dask graph looks identical, so it must be the way the indexers are constructed.

The major difference I can find is: The old version used np.unique while the new version uses xarrays cond.any(..)

Maybe someone with more experience in dask can help out?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset.where performances regression. 1575938277
1445469752 https://github.com/pydata/xarray/issues/7516#issuecomment-1445469752 https://api.github.com/repos/pydata/xarray/issues/7516 IC_kwDOAMm_X85WKB44 headtr1ck 43316012 2023-02-26T21:16:35Z 2023-02-26T21:16:35Z COLLABORATOR

Git bisect pinpoints this to https://github.com/pydata/xarray/pull/6690 which funny enough, is my PR haha. I will look into it when I find time :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset.where performances regression. 1575938277
1445467918 https://github.com/pydata/xarray/issues/7516#issuecomment-1445467918 https://api.github.com/repos/pydata/xarray/issues/7516 IC_kwDOAMm_X85WKBcO headtr1ck 43316012 2023-02-26T21:07:56Z 2023-02-26T21:07:56Z COLLABORATOR

Can confirm, on my machine it went from 520ms to 5s

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset.where performances regression. 1575938277

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 798.735ms · About: xarray-datasette