html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/1603#issuecomment-1259326037,https://api.github.com/repos/pydata/xarray/issues/1603,1259326037,IC_kwDOAMm_X85LD8pV,4160723,2022-09-27T10:50:36Z,2022-09-27T10:50:36Z,MEMBER,"Should we close this issue and continue the discussion in #6293? For anyone who wants to track the progress on this topic: https://github.com/pydata/xarray/projects/1 ","{""total_count"": 2, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 2, ""eyes"": 0}",,262642978 https://github.com/pydata/xarray/issues/1603#issuecomment-949494376,https://api.github.com/repos/pydata/xarray/issues/1603,949494376,IC_kwDOAMm_X844mCJo,4160723,2021-10-22T10:27:26Z,2021-10-22T10:27:26Z,MEMBER,"> well, both ""contain the origin dims"" or just ""generate another one"" have its benefit. Agreed, and both are supported by xarray actually. In case we want to keep the original dimensions like (""x"", ""y"") in the example above, it's better to use [masking](http://xarray.pydata.org/en/stable/user-guide/indexing.html#masking-with-where). This discussion is broader than the topic covered in this issue so I'd suggest you [start a new discussion](https://github.com/pydata/xarray/discussions/new) if you want to further discuss this with the xarray community. Thanks.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,262642978 https://github.com/pydata/xarray/issues/1603#issuecomment-949449312,https://api.github.com/repos/pydata/xarray/issues/1603,949449312,IC_kwDOAMm_X844l3Jg,4160723,2021-10-22T09:28:01Z,2021-10-22T09:28:01Z,MEMBER,"For such case you could already do `ds.stack(z=(""t"", ""x"")).set_index(z=""C2"").sel(z=[""a"", ""e"", ""h""])`. After the explicit index refactor, we could imagine a custom index that supports multi-dimension coordinates such that you would only need to do something like ```python >>> S_res = S4.sel(C2=(""z"", [""a"", ""e"", ""h""])) >>> S_res Dimensions: (z: 3) Coordinates: * C2 (z) >> S_res = S4.sel(C2=[""a"", ""e"", ""h""]) >>> S_res Dimensions: (C2: 3) Coordinates: * C2 (C2) I guess the error is probably the best idea. Agreed. It seems very strict indeed, but it will be easier to relax this later than the other way. There is also a (very rare?) case where the two indexed coordinates have the same labels but are named differently in the two datasets (e.g., ``station_name`` and ``sname``). In that case an error is probably better too. It would be a sort of indication that the most useful thing to do for future operations is to rename one of those coordinates first.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,262642978 https://github.com/pydata/xarray/issues/1603#issuecomment-444132393,https://api.github.com/repos/pydata/xarray/issues/1603,444132393,MDEyOklzc3VlQ29tbWVudDQ0NDEzMjM5Mw==,4160723,2018-12-04T15:06:21Z,2018-12-04T15:19:08Z,MEMBER,"> It occurs to me that for the case of ""multiple single indexes"" along the same dimension there is no good way to use them simultaneously for indexing/reindexing at the same time. Sorry for maybe asking this again but I'm a bit confused now: is there any good reason of supporting ""multiple single indexes"" along the same dimension? After all, perhaps better defaults would be to set indexes (``pandas.Index``) only for 1-d coordinates matching dimension names, like it is the case now. If you want a different behavior, then you need to use ``.set_index()``, which would raise if it results in multiple single indexes along a dimension. We could also add a new ``indexes`` argument to the ``Dataset`` / ``DataArray`` constructors to save some typing (and avoid the creation of in-memory ``pandas.Index`` for very long coordinates if an out-of-core alternative is later supported). > da[dim_name] should return all the indexes on that dimension I think that one big source of confusion has been so far mixing coordinates/variables and indexes. These are really two separate concepts, and the indexes refactoring should address that IMHO. For example, I think that ``da[some_name]`` should never return indexes but only coordinates (and/or data variables for Dataset). That would be much simpler. Take for example ```python >>> da = xr.DataArray(np.random.rand(2, 2), ... dims=('one', 'two'), ... coords={'one_labels': ('one', ['a', 'b'])}) >>> da array([[ 0.536028, 0.291895], [ 0.682108, 0.926003]]) Coordinates: one_labels (one) >> da['one'] array([0, 1]) Coordinates: one_labels (one) ds.sel(multi=list_of_pairs) can probably be replaced by ds.sel(x=..., y=...), but how about reindex along MultiIndex? Indeed I haven't really thought about ``reindex`` and alignment in my suggestion above. How do you currently ``reindex`` along a multi-index dimension? Contrary to ``.sel``, ``ds.reindex(multi=list_of_pairs)`` doesn't seem to work (the list of n-length tuples being interpreted as a ~~n-dim~~ 2-d array). The only way I've found to make it work is to pass another ``pandas.MultiIndex``. Wouldn't be it rather confusing if we choose to go with our own implementation of MultiIndex for xarray instead of ``pandas.MultiIndex``? Wouldn't be possible to easily support ``ds.reindex(x=..., y=...)`` within the new data model proposed here? > Am I right in thinking the Multi-indexes is only a helpful note to users, rather than conveying anything about how data is accessed? This is a good question. A related question: apart from ``ds.sel(multi=list_of_pairs)`` and ``ds.reindex(multi=list_of_pairs)`` use cases discussed so far, is there other reasons of having a variable for a multi-index? > I think we can do much of this before adding the ability to set custom indexes, which would be cool but further from where we are, I think. I agree, although whether or not we will eventually support custom indexes might influence the design choices that we have to do now, IMO. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,262642978 https://github.com/pydata/xarray/issues/1603#issuecomment-442797084,https://api.github.com/repos/pydata/xarray/issues/1603,442797084,MDEyOklzc3VlQ29tbWVudDQ0Mjc5NzA4NA==,4160723,2018-11-29T11:15:17Z,2018-11-29T11:15:17Z,MEMBER,"> we will definitely have to make some intentional deviations from the behavior of pandas Looking at the reported issues related to multi-indexes in xarray, I have the same feeling. Simply reusing ``pandas.MultiIndex`` in xarray where slightly different semantics are generally expected has shown to be painful. It seems easier to have our own baked solution and deal with differences during xarray<-> pandas conversion if needed. If we re-design indexes so that we allow 3rd-party indexes, maybe we could support both and let the user choose the one (xarray or pandas baked) that best suits his needs? Regarding MultiIndex as part of the data schema vs an implementation detail, if we support extending indexes (and already given the different kinds of multi-coordinate indexes: MultiIndex, KDTree, etc.), then I think that it should be transparent to the user. However, I don't really see why a multi-coordinate index should have its own variable (with tuples of values). I don't want to speak for others, but IMHO ``ds.sel(multi=list_of_pairs)`` is rather a edge case and I'm not sure if we really need to support it. Using ``ds.sel(x=..., y=...)`` with DataArray objects is certainly more code to write, but this form of indexing is very powerful and it *might not* be a bad idea to encourage it. If a variable for each multi-coordinate index is ""just"" for data schema consistency, then why not showing all those indexes in a separate section of the repr? For example: ``` Coordinates: * level_1 (x) object 'a' 'a' 'b' 'b' * level_2 (x) int64 1 2 1 2 Multi-indexes: pandas.MultiIndex [level_1, level_2] ``` It is equally transparent, not more verbose, and it is clear that multi-indexes are not part of the coordinates (in fact there is no need of ""virtual"" coordinates either, nor to name the index). I don't think single indexes should be shown here as it would results in duplicated, uninformative lines. More generally, here is how I would see indexes handled in xarray (I might be missing important aspects, though): - Default behavior: all 1-dimensional coordinates each have their own, single index (``pandas.Index``), unless explicitly stated. - Explicit API is used for setting new, possibly multi-coordinate indexes. Note the absence of keyword argument below to specify the variables: This is actually more consistent with the pandas API but this would be a breaking change and I don't know how a smooth transition could look like. - ``set_index(['x', 'y'], kind='multiindex') # xarray built-in index`` - ``set_index(['x', 'y'], kind='kdtree') # xarray built-in index`` - ``set_index('x', kind=ASingleIndexWrapperClass) # 3rd-party index`` - If a coordinate is removed from the Dataset or if its index is reset or changed: - If the coordinate had a single index, no problem - If the coordinate was part of a multi-coordinate index: a new index is built from all remaining coordinates that were also part of the original index, if it is supported. Otherwise, the original index is removed and the default behavior (single ``pandas.Index``) is reset for all those remaining coordinates. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,262642978 https://github.com/pydata/xarray/issues/1603#issuecomment-334091075,https://api.github.com/repos/pydata/xarray/issues/1603,334091075,MDEyOklzc3VlQ29tbWVudDMzNDA5MTA3NQ==,4160723,2017-10-04T08:52:08Z,2017-10-04T08:52:08Z,MEMBER,"I think that promoting ""Indexes"" to a first-class concept is indeed a very good idea, at both internal and public levels, even if at the latter level it would be another concept for users (it should be already familiar for pandas users, though). IMHO the ""coordinate"" and ""index"" concepts are different enough to consider them separately. I like the proposed repr for `Dataset.indexes`. I wouldn't mind if it is not included in `Dataset.__repr__`, considering that multi-indexes, kdtree, etc. only represent a few use cases. In too many cases it could result in a long, uninformative list of simple `pandas.Index`. I have to think a bit more about the details but I like the idea. ","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,262642978