home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

12 rows where author_association = "MEMBER", issue = 262642978 and user = 4160723 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • benbovy · 12 ✖

issue 1

  • Explicit indexes in xarray's data-model (Future of MultiIndex) · 12 ✖

author_association 1

  • MEMBER · 12 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1259326037 https://github.com/pydata/xarray/issues/1603#issuecomment-1259326037 https://api.github.com/repos/pydata/xarray/issues/1603 IC_kwDOAMm_X85LD8pV benbovy 4160723 2022-09-27T10:50:36Z 2022-09-27T10:50:36Z MEMBER

Should we close this issue and continue the discussion in #6293?

For anyone who wants to track the progress on this topic: https://github.com/pydata/xarray/projects/1

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 2,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949494376 https://github.com/pydata/xarray/issues/1603#issuecomment-949494376 https://api.github.com/repos/pydata/xarray/issues/1603 IC_kwDOAMm_X844mCJo benbovy 4160723 2021-10-22T10:27:26Z 2021-10-22T10:27:26Z MEMBER

well, both "contain the origin dims" or just "generate another one" have its benefit.

Agreed, and both are supported by xarray actually. In case we want to keep the original dimensions like ("x", "y") in the example above, it's better to use masking.

This discussion is broader than the topic covered in this issue so I'd suggest you start a new discussion if you want to further discuss this with the xarray community. Thanks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949449312 https://github.com/pydata/xarray/issues/1603#issuecomment-949449312 https://api.github.com/repos/pydata/xarray/issues/1603 IC_kwDOAMm_X844l3Jg benbovy 4160723 2021-10-22T09:28:01Z 2021-10-22T09:28:01Z MEMBER

For such case you could already do ds.stack(z=("t", "x")).set_index(z="C2").sel(z=["a", "e", "h"]).

After the explicit index refactor, we could imagine a custom index that supports multi-dimension coordinates such that you would only need to do something like

```python

S_res = S4.sel(C2=("z", ["a", "e", "h"])) S_res <xarray.Dataset> Dimensions: (z: 3) Coordinates: * C2 (z) <U1 'a' 'e' 'h' Data variables: A1 (z) float64 4 3 3 ```

or without explicitly providing the name of the packed dimension:

```python

S_res = S4.sel(C2=["a", "e", "h"]) S_res <xarray.Dataset> Dimensions: (C2: 3) Coordinates: * C2 (C2) <U1 'a' 'e' 'h' Data variables: A1 (C2) float64 4 3 3 ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949413144 https://github.com/pydata/xarray/issues/1603#issuecomment-949413144 https://api.github.com/repos/pydata/xarray/issues/1603 IC_kwDOAMm_X844luUY benbovy 4160723 2021-10-22T08:41:36Z 2021-10-22T08:41:36Z MEMBER

Sorry but this is confusing. To me It still looks like you want implicit broadcasting of the A3 variable along the x dimension. In your last comment you depict A3 inconsistently with a 2-d shape but with only the t dimension. I'm also not sure how your suggestion relates to the issue here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949358898 https://github.com/pydata/xarray/issues/1603#issuecomment-949358898 https://api.github.com/repos/pydata/xarray/issues/1603 IC_kwDOAMm_X844lhEy benbovy 4160723 2021-10-22T07:22:24Z 2021-10-22T07:22:24Z MEMBER

Thanks for the detailed description @weipeng1999. For the first 4 slides I don't see how this is different from how does S_res = S1.sel(C1=['a', 'b'] and S_res = S2.sel(C1=['a', 'b']) currently? And for the last 2 slides, I don't think that we always want such implicit broadcasting for dimensions that are not involved in the indexed coordinates.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
946474674 https://github.com/pydata/xarray/issues/1603#issuecomment-946474674 https://api.github.com/repos/pydata/xarray/issues/1603 IC_kwDOAMm_X844ag6y benbovy 4160723 2021-10-19T08:19:54Z 2021-10-19T08:19:54Z MEMBER

Hi @weipeng1999,

I'm not sure to fully understand your suggestion, would you mind sharing some illustrative examples?

It is useful to have two distinct coordinate variable vs data variable concepts. Although both are data arrays, the former is used to locate data in the dimensional space(s) defined by all dimensions in the dataset while the latter is used to store field data.

It also helps to have a clear separation between the coordinate variable and index concepts. An index is a specific data structure or object that allows efficient data extraction or alignment based one or more coordinate labels. Sometimes an index object may be handled like a data array (like pandas indexes) but this is not always the case (e.g., a KD-Tree).

Currently in Xarray the index concept is hidden behind "dimension" coordinate variables. The goal of the explicit index refactor is to bring it to the light and make it available to any coordinate (and also open it to custom index structures, not only pandas indexes).

It looks like what you suggest is some kind of implicit (co-)indexes hidden behind any dataset variable(s)? We actually took the opposite direction, trying to make everything explicit.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
444403484 https://github.com/pydata/xarray/issues/1603#issuecomment-444403484 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0NDQwMzQ4NA== benbovy 4160723 2018-12-05T08:39:35Z 2018-12-05T08:39:35Z MEMBER

I guess the error is probably the best idea.

Agreed. It seems very strict indeed, but it will be easier to relax this later than the other way. There is also a (very rare?) case where the two indexed coordinates have the same labels but are named differently in the two datasets (e.g., station_name and sname). In that case an error is probably better too. It would be a sort of indication that the most useful thing to do for future operations is to rename one of those coordinates first.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
444132393 https://github.com/pydata/xarray/issues/1603#issuecomment-444132393 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0NDEzMjM5Mw== benbovy 4160723 2018-12-04T15:06:21Z 2018-12-04T15:19:08Z MEMBER

It occurs to me that for the case of "multiple single indexes" along the same dimension there is no good way to use them simultaneously for indexing/reindexing at the same time.

Sorry for maybe asking this again but I'm a bit confused now: is there any good reason of supporting "multiple single indexes" along the same dimension?

After all, perhaps better defaults would be to set indexes (pandas.Index) only for 1-d coordinates matching dimension names, like it is the case now.

If you want a different behavior, then you need to use .set_index(), which would raise if it results in multiple single indexes along a dimension. We could also add a new indexes argument to the Dataset / DataArray constructors to save some typing (and avoid the creation of in-memory pandas.Index for very long coordinates if an out-of-core alternative is later supported).

da[dim_name] should return all the indexes on that dimension

I think that one big source of confusion has been so far mixing coordinates/variables and indexes. These are really two separate concepts, and the indexes refactoring should address that IMHO.

For example, I think that da[some_name] should never return indexes but only coordinates (and/or data variables for Dataset). That would be much simpler.

Take for example

```python

da = xr.DataArray(np.random.rand(2, 2), ... dims=('one', 'two'), ... coords={'one_labels': ('one', ['a', 'b'])}) da <xarray.DataArray (one: 2, two: 2)> array([[ 0.536028, 0.291895], [ 0.682108, 0.926003]]) Coordinates: one_labels (one) <U1 'a' 'b' Dimensions without coordinates: one, two ```

I find it so weird being able to do this:

```python

da['one'] <xarray.DataArray 'one' (one: 2)> array([0, 1]) Coordinates: one_labels (one) <U1 'a' 'b' Dimensions without coordinates: one ```

Where does come from array([0, 1])? I wouldn't have been surprised if a KeyError was raised instead. Perhaps this specific case was initially for backward compatibility when the "dimensions without indexes" feature has been introduced, but it was a long time ago and I'm not sure this is still necessary.

I might be a good thing explicitly requiring da.set_index('one_labels') to enable indexing/alignment (edit: label indexing/alignment) along dimension one in the example above.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
443172604 https://github.com/pydata/xarray/issues/1603#issuecomment-443172604 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0MzE3MjYwNA== benbovy 4160723 2018-11-30T11:14:24Z 2018-11-30T11:14:24Z MEMBER

A couple of thoughts:

If nothing useful can be done in the case of "multiple single indexes", would it make sense to discourage users explicitly creating multiple single indexes along a dimension? "Multiple single indexes" would be just a default situation when nothing specific as been defined yet or resulting from a failback.

For example, why not requiring that set_index(['x', 'y']) (with a list as argument) should always result in a multi-index regardless of the kind argument, i.e., raise if a single index is given? This is close to the current behavior, I think. This would require calling set_index for each single index that we want to (re)define, but I don't think setting a lot of single indexes at the same time is something that often happens.

Hence, would it be possible to avoid append=None and instead change the default to append=True?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442907394 https://github.com/pydata/xarray/issues/1603#issuecomment-442907394 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0MjkwNzM5NA== benbovy 4160723 2018-11-29T16:49:12Z 2018-11-29T17:18:10Z MEMBER

ds.sel(multi=list_of_pairs) can probably be replaced by ds.sel(x=..., y=...), but how about reindex along MultiIndex?

Indeed I haven't really thought about reindex and alignment in my suggestion above.

How do you currently reindex along a multi-index dimension?

Contrary to .sel, ds.reindex(multi=list_of_pairs) doesn't seem to work (the list of n-length tuples being interpreted as a ~~n-dim~~ 2-d array). The only way I've found to make it work is to pass another pandas.MultiIndex. Wouldn't be it rather confusing if we choose to go with our own implementation of MultiIndex for xarray instead of pandas.MultiIndex?

Wouldn't be possible to easily support ds.reindex(x=..., y=...) within the new data model proposed here?

Am I right in thinking the Multi-indexes is only a helpful note to users, rather than conveying anything about how data is accessed?

This is a good question.

A related question: apart from ds.sel(multi=list_of_pairs) and ds.reindex(multi=list_of_pairs) use cases discussed so far, is there other reasons of having a variable for a multi-index?

I think we can do much of this before adding the ability to set custom indexes, which would be cool but further from where we are, I think.

I agree, although whether or not we will eventually support custom indexes might influence the design choices that we have to do now, IMO.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442797084 https://github.com/pydata/xarray/issues/1603#issuecomment-442797084 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0Mjc5NzA4NA== benbovy 4160723 2018-11-29T11:15:17Z 2018-11-29T11:15:17Z MEMBER

we will definitely have to make some intentional deviations from the behavior of pandas

Looking at the reported issues related to multi-indexes in xarray, I have the same feeling. Simply reusing pandas.MultiIndex in xarray where slightly different semantics are generally expected has shown to be painful. It seems easier to have our own baked solution and deal with differences during xarray<-> pandas conversion if needed.

If we re-design indexes so that we allow 3rd-party indexes, maybe we could support both and let the user choose the one (xarray or pandas baked) that best suits his needs?

Regarding MultiIndex as part of the data schema vs an implementation detail, if we support extending indexes (and already given the different kinds of multi-coordinate indexes: MultiIndex, KDTree, etc.), then I think that it should be transparent to the user.

However, I don't really see why a multi-coordinate index should have its own variable (with tuples of values). I don't want to speak for others, but IMHO ds.sel(multi=list_of_pairs) is rather a edge case and I'm not sure if we really need to support it. Using ds.sel(x=..., y=...) with DataArray objects is certainly more code to write, but this form of indexing is very powerful and it might not be a bad idea to encourage it.

If a variable for each multi-coordinate index is "just" for data schema consistency, then why not showing all those indexes in a separate section of the repr? For example:

Coordinates: * level_1 (x) object 'a' 'a' 'b' 'b' * level_2 (x) int64 1 2 1 2 Multi-indexes: pandas.MultiIndex [level_1, level_2]

It is equally transparent, not more verbose, and it is clear that multi-indexes are not part of the coordinates (in fact there is no need of "virtual" coordinates either, nor to name the index). I don't think single indexes should be shown here as it would results in duplicated, uninformative lines.

More generally, here is how I would see indexes handled in xarray (I might be missing important aspects, though):

  • Default behavior: all 1-dimensional coordinates each have their own, single index (pandas.Index), unless explicitly stated.
  • Explicit API is used for setting new, possibly multi-coordinate indexes. Note the absence of keyword argument below to specify the variables: This is actually more consistent with the pandas API but this would be a breaking change and I don't know how a smooth transition could look like.
    • set_index(['x', 'y'], kind='multiindex') # xarray built-in index
    • set_index(['x', 'y'], kind='kdtree') # xarray built-in index
    • set_index('x', kind=ASingleIndexWrapperClass) # 3rd-party index
  • If a coordinate is removed from the Dataset or if its index is reset or changed:
    • If the coordinate had a single index, no problem
    • If the coordinate was part of a multi-coordinate index: a new index is built from all remaining coordinates that were also part of the original index, if it is supported. Otherwise, the original index is removed and the default behavior (single pandas.Index) is reset for all those remaining coordinates.
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334091075 https://github.com/pydata/xarray/issues/1603#issuecomment-334091075 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDMzNDA5MTA3NQ== benbovy 4160723 2017-10-04T08:52:08Z 2017-10-04T08:52:08Z MEMBER

I think that promoting "Indexes" to a first-class concept is indeed a very good idea, at both internal and public levels, even if at the latter level it would be another concept for users (it should be already familiar for pandas users, though). IMHO the "coordinate" and "index" concepts are different enough to consider them separately.

I like the proposed repr for Dataset.indexes. I wouldn't mind if it is not included in Dataset.__repr__, considering that multi-indexes, kdtree, etc. only represent a few use cases. In too many cases it could result in a long, uninformative list of simple pandas.Index.

I have to think a bit more about the details but I like the idea.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 24.11ms · About: xarray-datasette