home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 379905457

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1603#issuecomment-379905457 https://api.github.com/repos/pydata/xarray/issues/1603 379905457 MDEyOklzc3VlQ29tbWVudDM3OTkwNTQ1Nw== 1217238 2018-04-09T21:52:02Z 2018-04-11T04:34:43Z MEMBER

I've been thinking about getting started on this. Here are my current thoughts on the right design approach.


Data model

Dataset.indexes and DataArray.indexes

My current thinking is that indexes should simply be a dictionary mapping from coordinate and/or dimension names to pandas.Index objects. Mapping from label-based to integer-based then becomes simply a matter of looking up the appropriate indexes for each coordinate/dimension (i.e., the keyword argument names in .sel()), and using the corresponding index(es) to transform label-based indexers into integer indexers.

If multiple coordinates are part of the same index, they should point to the same MultiIndex/KDTree object. The MultiIndex would be responsible for resolving the combined indexing operation along the coordinate dimension(s).

By default, indexes is populated with an Index/MultiIndex for each dimension of all indexes along that dimension. Additional indexes may be set manually, e.g., using set_index().

Indexes keyed by a dimension name are used for axis-positional indexing with .loc and for alignment with reindex/align. However, if the index is a MultiIndex with a level name matching a coordinate, then only that level will be used for indexing/alignment. In other words: the coordinate name corresponding to indexing request takes precedence, but if it isn't found, we use all indexes along the dimension.

Separate indexers without a MultiIndex should be prohibited

It should be impossible to express inconsistent and/or confusing states in xarray's data model. This sort of inconsistency (e.g., levels not being stored directly in Dataset.variables) is the major source of our issues with the current MultiIndex data model.

I'm particularly concerned about the clearly showing difference between coordinates that are part of a MultiIndex and coordinates that are separately indexed. I suspect we could make indexing operations nearly equivalent from a user perspective, but there would likely remain small differences that would be a source of confusion and bugs. Preserving indexes in the form in which they are created is not also not really an option, because there are lots of xarray operations that would probably normalize indexes into standard forms, such as groupby, stack/unstack and to/from_pandas.

The simplest option is to prohibit one of these cases entirely, either: 1. Always group repeated indexes along a dimension into a MultiIndex, or 2. Never use pandas.MultiIndex (keep separate indexes for each coordinate).

From xarray's perspective, it would certainly be cleaner to prohibit MultiIndex. The level order dependent behavior of MultiIndex is not the best fit for xarray's data model, and could be challenging to keep in sync with coordinate order on xarray objects. We would need to ensure that coordinate/level order remains consistent in all operations, or at least ensure that coordinates are always printed in order of their appearence in MultiIndex levels. (We generally preserve coordinate order already, but well behaved programs using xarray currently don't need to rely on this behavior.)

That said, always using MultiIndexes for multiple indexes along the same dimension has it's own clear advantages. First, it's consistent with pandas, which makes it easier to transition data back and forth. Second, simultaneous indexing operations across MultiIndex levels would be difficult to express efficiently with a MultiIndex. This is probably the right choice for xarray.

We could potentially allow for non-consolidated indexes (not part of a MultiIndex) when using the advanced API (e.g., directly setting the indexes parameter). But we'll save this for later.

Functionality

Index variables

Every MultiIndex level must have a corresponding xarray.Variable object in coordinates on each Dataset/DataArray on which they appear. These objects may reference the same pandas.Index/pandas.MultiIndex object used for indexing, but must have immutable data (e.g., flag.writeable = False in NumPy). For now, I expect to reuse the existing IndexVariable class.

Now that levels are xarray.Variable objects, there will no longer be a Variable object in Dataset._variables/DataArray._coords corresponding to a pandas.MultiIndex. However, we will continue to create a "virtual variable" upon indexing consisting of an dtype=object array of MultiIndex values, as a fallback if there is no coordinate matching a dimension name.

Mapping indexes into pandas

Another concern is how to map all of the new possible indexing states into pandas:

```

case 1 (one indexed variable, same name as dimension):

  • time (time)

case 2 (one indexed variable, different name from dimension):

  • year (time)

case 3 (multiple indexed variables, one has same name as dimension):

  • time (time)
  • year (time)

case 4 (multiple indexed variables, all have different names from dimension):

  • year (time)
  • month (time) ```

For consistency with current behavior, case 1 should correspond to a standard pandas.Index and case 4 should correspond to a pandas.MultiIndex. But what about the intermediate cases 2 and 3, which are currently prohibited by xarray's data model?

I think we should use the rule that all indexed variables are consolidated into a single Index in pandas. If there are multiple indexed variables (case 3 or 4), this would be a MultiIndex; otherwise (cases 2 or 3), this would be a standard Index. This has a virtue of speed and simplicity: we can simply reuse the existing Index or MultiIndex object from indexes.

The other option would be prohibit cases 2 and 3 (like we currently do), because we will not be able to map them into pandas and back faithfully. I think this would be a mistake, because indexes on multiple levels would be useful for xarray, even if one level corresponds to the dimension name.

Indexes for unstack

With the introduction of more flexible and optional index levels, it may not always may sense to unstack() every index coordinate. We should support optionally specifying levels to unstack, possibly with an API mirroring stack(), e.g., perhaps .unstack(dim_name=['level_0', 'level_1']) to unstack coordinates level_0 and level_1 from dimension dim_name.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  262642978
Powered by Datasette · Queries took 0.755ms · About: xarray-datasette