home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where author_association = "MEMBER" and issue = 864249974 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • benbovy 2
  • max-sixty 2
  • shoyer 1

issue 1

  • Make creating a MultiIndex in stack optional · 5 ✖

author_association 1

  • MEMBER · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
935683056 https://github.com/pydata/xarray/issues/5202#issuecomment-935683056 https://api.github.com/repos/pydata/xarray/issues/5202 IC_kwDOAMm_X843xWPw benbovy 4160723 2021-10-06T07:50:59Z 2021-10-06T07:50:59Z MEMBER

From https://github.com/pydata/xarray/pull/5692#issuecomment-925718593:

One change is that a multi-index is not always created with stack. It is created only if each of the dimensions to stack together have one and only one coordinate with a pandas index (this could be a non-dimension coordinate).

This could maybe address #5202, since we could simply drop the indexes before stacking the dimensions in order to avoid the creation of a multi-index. I don't think it's a big breaking change either unless there are users who rely on default multi-indexes with range (0, 1, 2...) levels. Looking at #5202, however, those default multi-indexes seem more problematic than something really useful, but I might be wrong here. Also, range-based indexes can still be created explicitly before stacking the dimensions if needed.

Another consequence is that stack is not always reversible, since unstack still requires a pandas multi-index (one and only one multi-index per dimension to unstack).

cc @pydata/xarray as this is an improvement regarding this issue but also a sensible change. To ensure a smoother transition we could maybe add a create_index option to stack which accepts these values:

  • True: always create a multi-index
  • False: never create a multi-index
  • None: create a multi-index only if we can unambiguously pick one index for each of the dimensions to stack

We can default to True now to avoid breaking changes and maybe later default to None. If we eventually add support for custom (non-pandas backed) indexes, we could also allow passing an xarray.Index class.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Make creating a MultiIndex in stack optional 864249974
856296662 https://github.com/pydata/xarray/issues/5202#issuecomment-856296662 https://api.github.com/repos/pydata/xarray/issues/5202 MDEyOklzc3VlQ29tbWVudDg1NjI5NjY2Mg== benbovy 4160723 2021-06-07T22:10:15Z 2021-06-07T22:10:15Z MEMBER

it seems like it could be a good idea to allow stack to skip creating a MultiIndex for the new dimension, via a new keyword argument such as ds.stack(index=False)

Dataset.stack might eventually accept any custom index (that supports it) if that makes sense. Would index=None be slightly better than index=False in that case? (considering that the default value would be index=PandasMultiIndex or something like that).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Make creating a MultiIndex in stack optional 864249974
825494167 https://github.com/pydata/xarray/issues/5202#issuecomment-825494167 https://api.github.com/repos/pydata/xarray/issues/5202 MDEyOklzc3VlQ29tbWVudDgyNTQ5NDE2Nw== max-sixty 5635139 2021-04-23T08:30:55Z 2021-04-23T08:30:55Z MEMBER

Great, this seems like a good idea — at the very least an index=False option

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Make creating a MultiIndex in stack optional 864249974
824459878 https://github.com/pydata/xarray/issues/5202#issuecomment-824459878 https://api.github.com/repos/pydata/xarray/issues/5202 MDEyOklzc3VlQ29tbWVudDgyNDQ1OTg3OA== shoyer 1217238 2021-04-22T00:57:56Z 2021-04-22T00:57:56Z MEMBER

Do we have any ideas on how expensive the MultiIndex creation is as a share of stack?

It depends, but it can easily be 50% to nearly 100% of the runtime. stack() uses reshape() on data variables, which is either free (for arrays that are still contiguous and can use views) or can be delayed until compute-time (with dask). In contrast, the MultiIndex is always created eagerly.

If we use Fortran order arrays, we can get a rough lower bound on the time for MultiIndex creation, e.g., consider: python import xarray import numpy as np a = xarray.DataArray(np.ones((5000, 5000), order='F'), dims=['x', 'y']) %prun a.stack(z=['x', 'y']) Not surprisingly, making the multi-index takes about half the runtime here.

Pandas does delay creating the actual hash-table behind a MultiIndex until it's needed, so I guess the main expense here is just allocating the new coordinate arrays.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Make creating a MultiIndex in stack optional 864249974
824388578 https://github.com/pydata/xarray/issues/5202#issuecomment-824388578 https://api.github.com/repos/pydata/xarray/issues/5202 MDEyOklzc3VlQ29tbWVudDgyNDM4ODU3OA== max-sixty 5635139 2021-04-21T22:05:53Z 2021-04-21T22:05:53Z MEMBER

Do we have any ideas on how expensive the MultiIndex creation is as a share of stack?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Make creating a MultiIndex in stack optional 864249974

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 90.474ms · About: xarray-datasette