issue_comments: 379905457

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/1603#issuecomment-379905457	https://api.github.com/repos/pydata/xarray/issues/1603	379905457	MDEyOklzc3VlQ29tbWVudDM3OTkwNTQ1Nw==	1217238	2018-04-09T21:52:02Z	2018-04-11T04:34:43Z	MEMBER	I've been thinking about getting started on this. Here are my current thoughts on the right design approach. Data model `Dataset.indexes` and `DataArray.indexes` My current thinking is that `indexes` should simply be a dictionary mapping from coordinate and/or dimension names to `pandas.Index` objects. Mapping from label-based to integer-based then becomes simply a matter of looking up the appropriate indexes for each coordinate/dimension (i.e., the keyword argument names in `.sel()`), and using the corresponding index(es) to transform label-based indexers into integer indexers. If multiple coordinates are part of the same index, they should point to the same `MultiIndex`/`KDTree` object. The MultiIndex would be responsible for resolving the combined indexing operation along the coordinate dimension(s). By default, `indexes` is populated with an Index/MultiIndex for each dimension of all indexes along that dimension. Additional indexes may be set manually, e.g., using `set_index()`. Indexes keyed by a dimension name are used for axis-positional indexing with `.loc` and for alignment with `reindex`/`align`. However, if the index is a MultiIndex with a level name matching a coordinate, then only that level will be used for indexing/alignment. In other words: the coordinate name corresponding to indexing request takes precedence, but if it isn't found, we use all indexes along the dimension. Separate indexers without a MultiIndex should be prohibited It should be impossible to express inconsistent and/or confusing states in xarray's data model. This sort of inconsistency (e.g., levels not being stored directly in `Dataset.variables`) is the major source of our issues with the current MultiIndex data model. I'm particularly concerned about the clearly showing difference between coordinates that are part of a `MultiIndex` and coordinates that are separately indexed. I suspect we could make indexing operations nearly equivalent from a user perspective, but there would likely remain small differences that would be a source of confusion and bugs. Preserving indexes in the form in which they are created is not also not really an option, because there are lots of xarray operations that would probably normalize indexes into standard forms, such as groupby, stack/unstack and to/from_pandas. The simplest option is to prohibit one of these cases entirely, either: 1. Always group repeated indexes along a dimension into a MultiIndex, or 2. Never use `pandas.MultiIndex` (keep separate indexes for each coordinate). From xarray's perspective, it would certainly be cleaner to prohibit MultiIndex. The level order dependent behavior of MultiIndex is not the best fit for xarray's data model, and could be challenging to keep in sync with coordinate order on xarray objects. We would need to ensure that coordinate/level order remains consistent in all operations, or at least ensure that coordinates are always printed in order of their appearence in MultiIndex levels. (We generally preserve coordinate order already, but well behaved programs using xarray currently don't need to rely on this behavior.) That said, always using MultiIndexes for multiple indexes along the same dimension has it's own clear advantages. First, it's consistent with pandas, which makes it easier to transition data back and forth. Second, simultaneous indexing operations across MultiIndex levels would be difficult to express efficiently with a MultiIndex. This is probably the right choice for xarray. We could potentially allow for non-consolidated indexes (not part of a MultiIndex) when using the advanced API (e.g., directly setting the `indexes` parameter). But we'll save this for later. Functionality Index variables Every MultiIndex level must have a corresponding xarray.Variable object in coordinates on each Dataset/DataArray on which they appear. These objects may reference the same `pandas.Index`/`pandas.MultiIndex` object used for indexing, but must have immutable data (e.g., `flag.writeable = False` in NumPy). For now, I expect to reuse the existing `IndexVariable` class. Now that levels are xarray.Variable objects, there will no longer be a `Variable` object in `Dataset._variables`/`DataArray._coords` corresponding to a `pandas.MultiIndex`. However, we will continue to create a "virtual variable" upon indexing consisting of an dtype=object array of MultiIndex values, as a fallback if there is no coordinate matching a dimension name. Mapping indexes into pandas Another concern is how to map all of the new possible indexing states into pandas: ``` case 1 (one indexed variable, same name as dimension): time (time) case 2 (one indexed variable, different name from dimension): year (time) case 3 (multiple indexed variables, one has same name as dimension): time (time) year (time) case 4 (multiple indexed variables, all have different names from dimension): year (time) month (time) ``` For consistency with current behavior, case 1 should correspond to a standard `pandas.Index` and case 4 should correspond to a `pandas.MultiIndex`. But what about the intermediate cases 2 and 3, which are currently prohibited by xarray's data model? I think we should use the rule that all indexed variables are consolidated into a single Index in pandas. If there are multiple indexed variables (case 3 or 4), this would be a MultiIndex; otherwise (cases 2 or 3), this would be a standard Index. This has a virtue of speed and simplicity: we can simply reuse the existing Index or MultiIndex object from `indexes`. The other option would be prohibit cases 2 and 3 (like we currently do), because we will not be able to map them into pandas and back faithfully. I think this would be a mistake, because indexes on multiple levels would be useful for xarray, even if one level corresponds to the dimension name. Indexes for unstack With the introduction of more flexible and optional index levels, it may not always may sense to `unstack()` every index coordinate. We should support optionally specifying levels to unstack, possibly with an API mirroring `stack()`, e.g., perhaps `.unstack(dim_name=['level_0', 'level_1'])` to unstack coordinates `level_0` and `level_1` from dimension `dim_name`.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		262642978