home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 442797084

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1603#issuecomment-442797084 https://api.github.com/repos/pydata/xarray/issues/1603 442797084 MDEyOklzc3VlQ29tbWVudDQ0Mjc5NzA4NA== 4160723 2018-11-29T11:15:17Z 2018-11-29T11:15:17Z MEMBER

we will definitely have to make some intentional deviations from the behavior of pandas

Looking at the reported issues related to multi-indexes in xarray, I have the same feeling. Simply reusing pandas.MultiIndex in xarray where slightly different semantics are generally expected has shown to be painful. It seems easier to have our own baked solution and deal with differences during xarray<-> pandas conversion if needed.

If we re-design indexes so that we allow 3rd-party indexes, maybe we could support both and let the user choose the one (xarray or pandas baked) that best suits his needs?

Regarding MultiIndex as part of the data schema vs an implementation detail, if we support extending indexes (and already given the different kinds of multi-coordinate indexes: MultiIndex, KDTree, etc.), then I think that it should be transparent to the user.

However, I don't really see why a multi-coordinate index should have its own variable (with tuples of values). I don't want to speak for others, but IMHO ds.sel(multi=list_of_pairs) is rather a edge case and I'm not sure if we really need to support it. Using ds.sel(x=..., y=...) with DataArray objects is certainly more code to write, but this form of indexing is very powerful and it might not be a bad idea to encourage it.

If a variable for each multi-coordinate index is "just" for data schema consistency, then why not showing all those indexes in a separate section of the repr? For example:

Coordinates: * level_1 (x) object 'a' 'a' 'b' 'b' * level_2 (x) int64 1 2 1 2 Multi-indexes: pandas.MultiIndex [level_1, level_2]

It is equally transparent, not more verbose, and it is clear that multi-indexes are not part of the coordinates (in fact there is no need of "virtual" coordinates either, nor to name the index). I don't think single indexes should be shown here as it would results in duplicated, uninformative lines.

More generally, here is how I would see indexes handled in xarray (I might be missing important aspects, though):

  • Default behavior: all 1-dimensional coordinates each have their own, single index (pandas.Index), unless explicitly stated.
  • Explicit API is used for setting new, possibly multi-coordinate indexes. Note the absence of keyword argument below to specify the variables: This is actually more consistent with the pandas API but this would be a breaking change and I don't know how a smooth transition could look like.
    • set_index(['x', 'y'], kind='multiindex') # xarray built-in index
    • set_index(['x', 'y'], kind='kdtree') # xarray built-in index
    • set_index('x', kind=ASingleIndexWrapperClass) # 3rd-party index
  • If a coordinate is removed from the Dataset or if its index is reset or changed:
    • If the coordinate had a single index, no problem
    • If the coordinate was part of a multi-coordinate index: a new index is built from all remaining coordinates that were also part of the original index, if it is supported. Otherwise, the original index is removed and the default behavior (single pandas.Index) is reset for all those remaining coordinates.
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  262642978
Powered by Datasette · Queries took 0.683ms · About: xarray-datasette