home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where issue = 1495605827 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 2

  • ravwojdyla 7
  • benbovy 3

author_association 2

  • NONE 7
  • MEMBER 3

issue 1

  • groupby+map performance regression on MultiIndex dataset · 10 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1353074006 https://github.com/pydata/xarray/issues/7376#issuecomment-1353074006 https://api.github.com/repos/pydata/xarray/issues/7376 IC_kwDOAMm_X85QpkVW ravwojdyla 1419010 2022-12-15T13:33:44Z 2022-12-15T13:33:44Z NONE

@benbovy thanks for the context and the PR #7382, exciting to see the improvement!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby+map performance regression on MultiIndex dataset 1495605827
1352989233 https://github.com/pydata/xarray/issues/7376#issuecomment-1352989233 https://api.github.com/repos/pydata/xarray/issues/7376 IC_kwDOAMm_X85QpPox benbovy 4160723 2022-12-15T12:27:37Z 2022-12-15T12:27:37Z MEMBER

Thanks @benbovy! Are you also aware of the issue with plain assign being slower on MultiIndex (comment above: https://github.com/pydata/xarray/issues/7376#issuecomment-1350446546)? Do you know what could be the issue there by any chance?

I see that in ds.assign(foo=~ds["d3"]), the coordinates of ~ds["d3"] are dropped (#2087), which triggers re-indexing of the multi-index when aligning ds with ~ds["d3"]. This is a quite expensive operation.

It is not clear to me what would be a clean fix (see, e.g., #2180), but we could probably optimize the alignment logic so that when all unindexed dimension sizes match with indexed dimension sizes (like your example) no re-indexing is performed.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby+map performance regression on MultiIndex dataset 1495605827
1352434183 https://github.com/pydata/xarray/issues/7376#issuecomment-1352434183 https://api.github.com/repos/pydata/xarray/issues/7376 IC_kwDOAMm_X85QnIIH ravwojdyla 1419010 2022-12-15T01:18:32Z 2022-12-15T01:18:32Z NONE

Thanks @benbovy! Are you also aware of the issue with plain assign being slower on MultiIndex (comment above: https://github.com/pydata/xarray/issues/7376#issuecomment-1350446546)? Do you know what could be the issue there by any chance?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby+map performance regression on MultiIndex dataset 1495605827
1352318926 https://github.com/pydata/xarray/issues/7376#issuecomment-1352318926 https://api.github.com/repos/pydata/xarray/issues/7376 IC_kwDOAMm_X85Qmr_O benbovy 4160723 2022-12-14T22:43:11Z 2022-12-14T22:47:37Z MEMBER

Are you aware of any workarounds for this issue with the current code (assuming I would like to preserve MultiIndex).

Unfortunately I don't know about any workaround that would preserve the MultiIndex. Depending on how you use the multi-index, you could instead set two single indexes for "i1" and "i2" respectively (it is supported now, use set_xindex()). I think that groupby will work well in that case. If you really need a multi-index, you could still build it afterwards from the groupby result.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby+map performance regression on MultiIndex dataset 1495605827
1351540523 https://github.com/pydata/xarray/issues/7376#issuecomment-1351540523 https://api.github.com/repos/pydata/xarray/issues/7376 IC_kwDOAMm_X85Qjt8r ravwojdyla 1419010 2022-12-14T14:40:18Z 2022-12-14T19:47:12Z NONE

👋 @benbovy thanks for the update. Looking at https://github.com/pydata/xarray/pull/5692, it must have been a huge effort, thank you for your work on that! Coming back to this issue, in the example above the version 2022.6.0 is about 600x slower, in our internal code, the code would not finish in a reasonable time, so that forced us to downgrade to 2022.3.0. Are you aware of any workarounds for this issue with the current code (assuming I would like to preserve MultiIndex).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby+map performance regression on MultiIndex dataset 1495605827
1350738301 https://github.com/pydata/xarray/issues/7376#issuecomment-1350738301 https://api.github.com/repos/pydata/xarray/issues/7376 IC_kwDOAMm_X85QgqF9 benbovy 4160723 2022-12-14T09:40:57Z 2022-12-14T09:40:57Z MEMBER

Thanks for the report @ravwojdyla.

Since #5692, multi-indexes level have each their own coordinate variable so copying takes a bit more time as we need to create more variables. Not sure what's happening with _maybe_cast_to_cftimeindex, though.

The real issue here, however, is the same than in #6836. In your example, .groupby("i1") creates 400 000 groups whereas it should create only 4 groups.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
  groupby+map performance regression on MultiIndex dataset 1495605827
1350446546 https://github.com/pydata/xarray/issues/7376#issuecomment-1350446546 https://api.github.com/repos/pydata/xarray/issues/7376 IC_kwDOAMm_X85Qfi3S ravwojdyla 1419010 2022-12-14T06:03:15Z 2022-12-14T06:06:02Z NONE

FYI this might warrant a separate issue(?), but an assign of a new DataArray e.g.: ds.assign(foo=~ds["d3"]) is also a couple of times (e.g. on 4M elements, same keys as above, ~7x slower) slower since 2022.6.0 (same commit).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby+map performance regression on MultiIndex dataset 1495605827
1350390046 https://github.com/pydata/xarray/issues/7376#issuecomment-1350390046 https://api.github.com/repos/pydata/xarray/issues/7376 IC_kwDOAMm_X85QfVEe ravwojdyla 1419010 2022-12-14T04:44:04Z 2022-12-14T04:52:26Z NONE

And just want to point out that the stacktraces/profile look very different between 2022.3.0 and main/latest. Looks like https://github.com/pydata/xarray/blob/021c73e12cccb06c017ce6420dd043a0cfbf9f08/xarray/core/indexes.py#L185 might be fairly expensive operation. Separately there seem to be quite a bit of time spend in copy -> copy_indexes path (deep copy?).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby+map performance regression on MultiIndex dataset 1495605827
1350378571 https://github.com/pydata/xarray/issues/7376#issuecomment-1350378571 https://api.github.com/repos/pydata/xarray/issues/7376 IC_kwDOAMm_X85QfSRL ravwojdyla 1419010 2022-12-14T04:22:28Z 2022-12-14T04:27:52Z NONE

3ead17ea9e99283e2511b65b9d864d1c7b10b3c4 (https://github.com/pydata/xarray/pull/5692) seems to be the commit that introduced this regression (cc: @benbovy)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby+map performance regression on MultiIndex dataset 1495605827
1350366220 https://github.com/pydata/xarray/issues/7376#issuecomment-1350366220 https://api.github.com/repos/pydata/xarray/issues/7376 IC_kwDOAMm_X85QfPQM ravwojdyla 1419010 2022-12-14T04:04:16Z 2022-12-14T04:04:16Z NONE

Also recorded py-spy flamegraphs and exported them in speedscope format at: https://gist.github.com/ravwojdyla/3b791debd3f97707d84748446dc07e39, you can view them in https://www.speedscope.app/

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby+map performance regression on MultiIndex dataset 1495605827

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 12.93ms · About: xarray-datasette