html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/pull/7214#issuecomment-1297046405,https://api.github.com/repos/pydata/xarray/issues/7214,1297046405,IC_kwDOAMm_X85NT1uF,4160723,2022-10-31T12:54:50Z,2022-10-31T12:54:50Z,MEMBER,"Thanks for the suggestion @shoyer, in general I like it very much! ""Coordinates possibly baked by one or more indexes"" feels much more natural than ""indexes and their corresponding coordinates"". Even though indexes have been promoted as 1st class citizens in the data model, their right place should still be in the background compared to coordinates. So having a `Coordinates` object that encapsulates the indexes makes a lot of sense to me. My main concern is about the timing, as such a broader refactor might postpone some work in progress on the public API and the documentation. Ideally this shouldn't discourage users to start experimenting with custom indexes and building an ecosystem around it, as soon as possible. There might be a fast path towards your suggestion, at least regarding the public facing API (your points 1 and 4): - Keep ""private"" the constructor of `Indexes` and keep it immutable. - Add a new `IndexedCoordinates(Coordinates)` class. Unlike `DatasetCoordinates` and `DataArrayCoordinates`, it would have a public constructor and/or alternative class methods (e.g., `.from_pandas_multi_index()` suggested by @dcherian) - In general, passing any `Coordinates` object to `coords` would assign both the coordinates and the indexes. This would let us the possibility to achieve a broader (mostly internal) refactor of `Indexes` and `Coordinates` objects later without the risk of introducing too much breaking changes. Alternatively, we could just wait for that refactor to finish before implementing explicit assignment of coordinates and indexes. We already have `.set_xindex()` and `.drop_indexes()` that are relevant and we could wait before deprecating `xr.Dataset(coords={""x"": pandas_midx})`. Not sure when such big refactor will happen, though, the wait could be long.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1295283938,https://api.github.com/repos/pydata/xarray/issues/7214,1295283938,IC_kwDOAMm_X85NNHbi,1217238,2022-10-28T17:49:10Z,2022-10-28T17:49:10Z,MEMBER,"> > Explicitly providing indexes is an advanced user feature. > > Agreed. However, `xr.Dataset(coords={""x"": pandas_midx})` is something that presumably a lot of users rely on (it is used extensively in Xarray's tests) and that we should really deprecate IMO. If we don't provide a convenient alternative, I expect many of those users will complain. I agree -- we should support this for backwards compatibility (even if we deprecate it). > > it's easier to explicitly manipulate indexes in the form of a dict > > While generally I also prefer handling plain `dict` objects over custom dict-like objects, here I don't see much reasons of manipulating Xarray index objects independently of their coordinate variables. `Indexes` allows keeping them tied together, and it is already returned by `.xindexes`. > > EDIT -- For more context: initially an `Indexes` object was almost equivalent to a `Frozen(obj._indexes)`. In #5692 I tried hard and struggled to keep dealing with separate dicts of indexes and indexed variables, but in the end it made things much easier to encapsulate the variables in `Indexes`, which is also used internally in different places. OK, this totally makes sense. I don't love that it is possible to express invalid states in Xarray's data model. This motivated the creation of `assert_internal_invariants` and currently mostly is a concern for Xarray's own developers, but when we exposes the `indexes` argument, it will be easier for users to make the same sort of errors. I wonder if we should consider the broader refactor of merging the `Indexes` and `Coordinates` objects, and expose the constructor as a public API. For clarity, I'll call it `CoordinatesAndIndexes` for now, but it could likely reuse the public name of `Coordinates`. This would have a number of benefits: 1. It's impossible to provide inconsistent `coords` and `indexes`, because there is no separate `indexes` argument. 2. Likewise, it is impossible to create inconsistent coordinates and indexes on an existing Xarray object. 3. All the logic for verifying consistent coords and indexes can go in one place, shared between Dataset/DataArray. (Yes, it would be annoying to refactor `Dataset` to merge in variables from `CoordinatesAndIndexes` rather than the current separate `Dataset._variables`) 4. The public API also becomes clearer: if users want default indexes, they can pass a dict of variables into `coords`. If they want to copy indexes from another object, they can pass in a `CoordinatesAndIndexes` object (either from another Xarray object or constructed directly).","{""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1295217001,https://api.github.com/repos/pydata/xarray/issues/7214,1295217001,IC_kwDOAMm_X85NM3Fp,2448579,2022-10-28T16:41:19Z,2022-10-28T16:41:36Z,MEMBER,">t would be adding another pandas.MultiIndex special case while we'd like to remove them in Xarray. True, but its neatly encapsulated in one place. And if we are going to support many Index classes that build on numpy, scipy, then support for pandas indexes seems entirely in-scope. > I'd just want to add that, from my experience with debugging multi-index issues, it is hard even for advanced users to see what's going wrong when coordinates and indexes are not consistent. Should we expose `xr.tests.assert_internal_invariants` for debugging?","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1294783661,https://api.github.com/repos/pydata/xarray/issues/7214,1294783661,IC_kwDOAMm_X85NLNSt,4160723,2022-10-28T09:49:02Z,2022-10-28T09:49:02Z,MEMBER,"> not necessarily do consistency checks (beyond verifying that the coordinate variables exist). I'd just want to add that, from my experience with debugging multi-index issues, it is hard even for advanced users to see what's going wrong when coordinates and indexes are not consistent.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1294771427,https://api.github.com/repos/pydata/xarray/issues/7214,1294771427,IC_kwDOAMm_X85NLKTj,4160723,2022-10-28T09:38:22Z,2022-10-28T09:38:22Z,MEMBER,"> Maybe a more generic Indexes class method that could be reused by 3rd-party indexes too? E.g., via some kind of hook or entrypoint... An `Indexes` accessor? Or this is going too far? 🙂 ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1293946521,https://api.github.com/repos/pydata/xarray/issues/7214,1293946521,IC_kwDOAMm_X85NIA6Z,4160723,2022-10-27T19:04:19Z,2022-10-27T19:52:21Z,MEMBER,"> Explicitly providing indexes is an advanced user feature. Agreed. However, `xr.Dataset(coords={""x"": pandas_midx})` is something that presumably a lot of users rely on (it is used extensively in Xarray's tests) and that we should really deprecate IMO. If we don't provide a convenient alternative, I expect many of those users will complain. > it's easier to explicitly manipulate indexes in the form of a dict While generally I also prefer handling plain `dict` objects over custom dict-like objects, here I don't see much reasons of manipulating Xarray index objects independently of their coordinate variables. `Indexes` allows keeping them tied together, and it is already returned by `.xindexes`. EDIT -- For more context: initially an `Indexes` object was almost equivalent to a `Frozen(obj._indexes)`. In #5692 I tried hard and struggled to keep dealing with separate dicts of indexes and indexed variables, but in the end it made things much easier to encapsulate the variables in `Indexes`, which is also used internally in different places. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1293909730,https://api.github.com/repos/pydata/xarray/issues/7214,1293909730,IC_kwDOAMm_X85NH37i,1217238,2022-10-27T18:28:40Z,2022-10-27T18:28:40Z,MEMBER,"> I'm thinking of only accepting one or more instances of [Indexes](https://github.com/pydata/xarray/blob/e678a1d7884a3c24dba22d41b2eef5d7fe5258e7/xarray/core/indexes.py#L1030) as indexes argument in the Dataset and DataArray constructors I would lean against this, only because it's easier to explicitly manipulate indexes in the form of a `dict` than an `xarray.Indexes` object. Explicitly providing indexes is an advanced user feature. I think it's OK to require users to do a bit more work in this case and to not necessarily do consistency checks (beyond verifying that the coordinate variables exist).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1293902008,https://api.github.com/repos/pydata/xarray/issues/7214,1293902008,IC_kwDOAMm_X85NH2C4,4160723,2022-10-27T18:21:02Z,2022-10-27T18:21:02Z,MEMBER,"> How about Indexes.from_pandas_multi_index() classmethod? Yes that would make sense. However, it would be adding another `pandas.MultiIndex` special case while we'd like to remove them in Xarray. Maybe a more generic `Indexes` class method that could be reused by 3rd-party indexes too? E.g., via some kind of hook or entrypoint... The tricky thing is that arguments would probably differ much from one index type to another. > 1. does indexes get merged with existing ._indexes? Indexes are not merged together but the new / replaced coordinate variables must be compatible with the other variables of the dataset. `Dataset.assign_indexes(indexes)` is actually implemented like this: ```python def assign_indexes(self, indexes: Indexes[Index]): ds_indexes = Dataset(indexes=indexes) return ( self # prepare drop-in index / coordinate replacement .drop_vars(indexes, errors=""ignore"") # ensure the new indexes / coordinates are compatible with the Dataset .merge( ds_indexes, compat=""minimal"", # probably not the right option? join=""override"", # fastest option? (no real effect because of `drop_vars`) combine_attrs=""no_conflicts"", ) ) ``` > 2. Can we extract enough information from Index to have xr.merge(Indexes) -> Indexes work? That is actually a good idea for https://github.com/pydata/xarray/pull/7214#issuecomment-1292089179! Not sure I would reuse `xr.merge()` for this as it would make the API messy, but why not an `xr.merge_indexes()` top-level function or an `Indexes.merge()` method?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1293774799,https://api.github.com/repos/pydata/xarray/issues/7214,1293774799,IC_kwDOAMm_X85NHW_P,2448579,2022-10-27T16:21:53Z,2022-10-27T16:21:53Z,MEMBER,"How about `Indexes.from_pandas_multi_index()` classmethod? I like the last spelling: `ds.assign_indexes(indexes)` though that raises a few questioins: 1. does `indexes` get merged with existing `._indexes`? 2. If not, how does a user merge manually? Can we extract enough information from `Index` to have `xr.merge(Indexes) -> Indexes` work?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1293531607,https://api.github.com/repos/pydata/xarray/issues/7214,1293531607,IC_kwDOAMm_X85NGbnX,4160723,2022-10-27T13:31:24Z,2022-10-27T13:42:44Z,MEMBER,"I also added an `.assign_indexes()` method that may be quite convenient. Like for the constructors, it only accepts an `Indexes` instance. ```python ds = xr.Dataset(coords={""x"": [4, 5, 6, 7]}) ds2 = xr.Dataset(coords={""x"": [1, 2, 3, 4]}) ds.assign_indexes(ds2.xindexes) # # Dimensions: (x: 4) # Coordinates: # * x (x) int64 1 2 3 4 # Data variables: # *empty* midx = pd.MultiIndex.from_product([[""a"", ""b""], [1, 2]], names=(""one"", ""two"")) indexes = wrap_pandas_multiindex(midx, ""x"") ds.assign_indexes(indexes) # # Dimensions: (x: 4) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' 'a' 'b' 'b' # * two (x) int64 1 2 1 2 # Data variables: # *empty* ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1293545325,https://api.github.com/repos/pydata/xarray/issues/7214,1293545325,IC_kwDOAMm_X85NGe9t,4160723,2022-10-27T13:41:50Z,2022-10-27T13:41:50Z,MEMBER,"@pydata/xarray I'd be very happy if you could share your thoughts about the examples shown in the last three comments. If you think the API looks good like that, then I will work on adding some tests and on the documentation.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1292089179,https://api.github.com/repos/pydata/xarray/issues/7214,1292089179,IC_kwDOAMm_X85NA7db,4160723,2022-10-26T13:54:22Z,2022-10-26T13:54:22Z,MEMBER,"Passing multiple indexes: ```python midx1 = pd.MultiIndex.from_product([[""a"", ""b""], [1, 2]], names=(""one"", ""two"")) midx2 = pd.MultiIndex.from_product([[""c"", ""d""], [3, 4]], names=(""three"", ""four"")) indexes1 = wrap_pandas_multiindex(midx1, ""x"") indexes2 = wrap_pandas_multiindex(midx2, ""y"") indexes = Indexes( indexes=dict(**indexes1, **indexes2), variables=dict(**indexes1.variables, **indexes2.variables) ) ds = xr.Dataset(indexes=indexes) # # Dimensions: (x: 4, y: 4) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' 'a' 'b' 'b' # * two (x) int64 1 2 1 2 # * y (y) object MultiIndex # * three (y) object 'c' 'c' 'd' 'd' # * four (y) int64 3 4 3 4 # Data variables: # *empty* ``` That's not looking super nice, but probably we can add some convenience function or `Indexes` method.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1291911349,https://api.github.com/repos/pydata/xarray/issues/7214,1291911349,IC_kwDOAMm_X85NAQC1,4160723,2022-10-26T11:47:57Z,2022-10-26T12:14:23Z,MEMBER,"I implemented option 3. We can still change or revert it later if it's not the best one. A few examples: ```python import pandas as pd import xarray as xr from xarray.indexes import wrap_pandas_multiindex midx = pd.MultiIndex.from_product([[""a"", ""b""], [1, 2]], names=(""one"", ""two"")) ``` It is now possible to pass a pandas multi-index to a Dataset like this: ```python # this returns an `Indexes` object (indexes + coordinates) indexes = wrap_pandas_multiindex(midx, ""x"") ds = xr.Dataset(indexes=indexes) # # Dimensions: (x: 4) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' 'a' 'b' 'b' # * two (x) int64 1 2 1 2 # Data variables: # *empty* ``` IMO the above should be preferred over passing it as a coordinate (should we deprecate it now?): ```python ds_deprecated = xr.Dataset(coords={""x"": midx}) ds_deprecated.identical(ds) # True # eventually this would behave like this: ds_midx_as_array = xr.Dataset(coords={""x"": midx}) # # Dimensions: (x: 4) # Coordinates: # * x (x) object ('a', 1) ('a', 2) ('b', 1) ('b', 2) # Data variables: # *empty* ``` We can pass indexes around from one Xarray object to another, e.g., ```python da = xr.DataArray([1, 2, 3, 4], dims=""x"", indexes=ds.xindexes) # # array([1, 2, 3, 4]) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' 'a' 'b' 'b' # * two (x) int64 1 2 1 2 ``` Skip creating pandas indexes for dimension coordinates: ```python ds_noindex = xr.Dataset(coords={""x"": [0, 1, 2]}, indexes={}) # # Dimensions: (x: 3) # Coordinates: # x (x) int64 0 1 2 # Data variables: # *empty* ds_noindex.xindexes # Indexes: # *empty* ``` ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1291638319,https://api.github.com/repos/pydata/xarray/issues/7214,1291638319,IC_kwDOAMm_X85M_NYv,4160723,2022-10-26T07:52:35Z,2022-10-26T07:52:35Z,MEMBER,"> For passing multiple indexes at once we could probably expand the Indexes API, e.g., with an .update() method. Maybe with something else than `.update()` (let's keep `Indexes` an immutable collection?)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1291059643,https://api.github.com/repos/pydata/xarray/issues/7214,1291059643,IC_kwDOAMm_X85M9AG7,4160723,2022-10-25T19:50:57Z,2022-10-25T19:50:57Z,MEMBER,"Hmm I'm wondering what would be best between the options below regarding the types for the `indexes` argument: 1. `Indexes[Index]` | `Sequence[Indexes[Index] | None` 2. `Indexes[Index] | None` 3. `Mapping[Any, Index] | None` 4. Any other suggestion? Option 1 is nice for passing multiple indexes, e.g., ```python pd_midx1 = pd.MultiIndex.from_arrays(..., names=(""one"", ""two"")) pd_midx2 = pd.MultiIndex.from_arrays(..., , names=(""three"", ""four"")) indexes1 = PandasMultiIndex.from_pandas_index(pd_midx1, ""x"") indexes2 = PandasMultiIndex.from_pandas_index(pd_midx2, ""y"") ds = xr.Dataset(indexes=[indexes1, indexes2]) ``` With option 1 it feels odd passing an empty list in order to avoid creating default indexes: `ds = xr.Dataset(indexes=[])`. Not really better in this regard with option 2: `ds = xr.Dataset(indexes=Indexes())`. Option 3 is better IMO: `ds = xr.Dataset(indexes={})`. Option 3 actually works in all cases since `Indexes[Index]` is a sub-type of `Mapping[Any, Index]`. However, it is not clear from this generic type that any non-empty mapping must be an instance of `Indexes` (because the latter also contains the coordinate variables). I'm leaning towards option 3. For passing multiple indexes at once we could probably expand the `Indexes` API, e.g., with an `.update()` method. What do people think? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378 https://github.com/pydata/xarray/pull/7214#issuecomment-1290896977,https://api.github.com/repos/pydata/xarray/issues/7214,1290896977,IC_kwDOAMm_X85M8YZR,35968931,2022-10-25T17:25:48Z,2022-10-25T17:25:48Z,MEMBER,Big moves!!,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1422543378