issue_comments: 442661526

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/1603#issuecomment-442661526	https://api.github.com/repos/pydata/xarray/issues/1603	442661526	MDEyOklzc3VlQ29tbWVudDQ0MjY2MTUyNg==	1217238	2018-11-29T00:42:39Z	2018-11-29T00:42:39Z	MEMBER	@max-sixty I like your schema vs. implementation breakdown. In general, I agree with you that it would be nice to have MultiIndex has an implementation detail rather than part of xarray's schema. But I'm not entirely sure that's feasible. Let's try to list out the pros/cons. Consider a MultiIndex 'multi' with levels 'x' and 'y': - Advantages of MultiIndex as part of the data schema: - There is an explicit coordinate (of tuples) corresponding to MultiIndex values, which can be returned from `ds.coords['multi']`. This is inherently not that useful compared to the separable variables, but is a cleaner solution that creating `ds.coords['multi']` as a "virtual" variable on the fly (which we would need for backwards compatibility). - We don't need to do full "normalization" when multiple indexes along the same dimension are encountered, e.g., in an operation that combines two different indexes, we would simply put both on the result instead of building a MultiIndex (which would require allocating a whole new array of integer codes). - The nature of the MultiIndex is more transparent as part of the data model. For example, if `x` and `y` are numeric, it could make sense to use either a MultiIndex or KDTree for indexing. Explicit APIs (e.g., `set_multiindex` and `set_kdtree`) would allow users a high level of control. - For advanced use-cases, it is potentially easier to work around the limitations of a MultiIndex, e.g., the way that some operations require lex-sorted-ness. - Advantages of MultiIndex as an implementation detail: - Simpler data model (for users). There are few good use cases for multiple indexes that aren't a MultiIndex. - Easier to do automatic alignment: we know that indexes will always have the same normalized form (in a MultiIndex). Otherwise, we would have to do this on the fly, or request that users explicitly setup compatible indexes. - More flexibility for xarray: we can potentially swap out indexing without changing the user-facing API. We might have something like a "hybrid" MultiIndex/KDTree that chooses the appropriate index based on the requested operation. - We don't need to create an explicit array of tuples for the MultiIndex variable (but we could still have a variable corresponding to a MultiIndex and only construct the `.data` array in a "lazy" fashion). - There's no need to name extraneous variables that only exist for the sake of a MultiIndex. - There's no need to support indexing like `ds.sel(multi=list_of_pairs)`. Indexing like `ds.sel(x=..., y=...)` solves the same use case and looks nicer. That said, this would be a minor backwards compatibility break (this currently works in xarray). P.S. I haven't made much progress on this yet so there's definitely still time to figure out the right decision -- thanks for your engagement on this!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		262642978