id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 2227413822,PR_kwDOAMm_X85rz7ZX,8911,Refactor swap dims,4160723,open,0,,,5,2024-04-05T08:45:49Z,2024-04-17T16:46:34Z,,MEMBER,,1,pydata/xarray/pulls/8911," - [ ] Attempt at fixing #8646 - [ ] Tests added - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [ ] New functions/methods are listed in `api.rst` I've tried here re-implementing `swap_dims` using `rename_dims`, `drop_indexes` and `set_xindex`. This fixes the example in #8646 but unfortunately this fails at handling the pandas multi-index special case (i.e., a single non-dimension coordinate wrapping a `pd.MultiIndex` that is promoted to a dimension coordinate in `swap-dims` auto-magically results in a `PandasMultiIndex` with both dimension and level coordinates). ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8911/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 2215059449,PR_kwDOAMm_X85rJr7c,8888,to_base_variable: coerce multiindex data to numpy array,4160723,open,0,,,3,2024-03-29T10:10:42Z,2024-03-29T15:54:19Z,,MEMBER,,0,pydata/xarray/pulls/8888," - [x] Closes #8887, and probably supersedes #8809 - [x] Tests added - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - ~~New functions/methods are listed in `api.rst`~~ @slevang this should also make work your test case added in #8809. I haven't added it here, instead I added a basic check that should be enough. I don't really understand why the serialization backends (zarr?) do not seem to work with the `PandasMultiIndexingAdapter.__array__()` implementation, which should normally coerce the multi-index levels into numpy arrays as needed. Anyway, I guess that coercing it early like in this PR doesn't hurt and may avoid the confusion of a non-indexed, isolated coordinate variable that still wraps a pandas.MultiIndex. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8888/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1839199929,PR_kwDOAMm_X85XUl4W,8051,Allow setting (or skipping) new indexes in open_dataset,4160723,open,0,,,9,2023-08-07T10:53:46Z,2024-02-03T19:12:48Z,,MEMBER,,0,pydata/xarray/pulls/8051," - [x] Closes #6633 - [ ] Tests added - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [ ] New functions/methods are listed in `api.rst` This PR introduces a new boolean parameter `set_indexes=True` to `xr.open_dataset()`, which may be used to skip the creation of default (pandas) indexes when opening a dataset. Currently works with the Zarr backend: ```python import numpy as np import xarray as xr # example dataset (real dataset may be much larger) arr = np.random.random(size=1_000_000) xr.Dataset({""x"": arr}).to_zarr(""dataset.zarr"") xr.open_dataset(""dataset.zarr"", set_indexes=False, engine=""zarr"") # # Dimensions: (x: 1000000) # Coordinates: # x (x) float64 ... # Data variables: # *empty* xr.open_zarr(""dataset.zarr"", set_indexes=False) # # Dimensions: (x: 1000000) # Coordinates: # x (x) float64 ... # Data variables: # *empty* ``` I'll add it to the other Xarray backends as well, but I'd like to get your thoughts about the API first. 1. Do we want to add yet another keyword parameter to `xr.open_dataset()`? There are already many... 2. Do we want to add this parameter to the `BackendEntrypoint.open_dataset()` API? - I'm afraid we must do it if we want this parameter in `xr.open_dataset()` - this would also make it possible skipping the creation of custom indexes (if any) in custom IO backends - con: if we require `set_indexes` in the signature in addition to the `drop_variables` parameter, this is a breaking change for all existing 3rd-party backends. Or should we group `set_indexes` with the other xarray decoder kwargs? This would feel a bit odd to me as setting indexes is different from decoding data. 3. Or should we leave this up to the backends? - pros: no breaking change, more flexible (3rd party backends may want to offer more control like choosing between custom indexes and default pandas indexes or skipping the creation of indexes by default) - cons: less discoverable, consistency is not enforced across 3rd party backends (although for such advanced case this is probably OK), not available by default in every backend. Currently 1 and 2 are implemented in this PR, although as I write this comment I think that I would prefer 3. I guess this depends on whether we prefer `open_***` vs. `xr.open_dataset(engine=""***"")` and unless I missed something there is still no real consensus about that? (e.g., #7496). ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8051/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1879109770,PR_kwDOAMm_X85ZbILy,8140,Deprecate passing pd.MultiIndex implicitly,4160723,open,0,,,23,2023-09-03T14:01:18Z,2023-11-15T20:15:00Z,,MEMBER,,0,pydata/xarray/pulls/8140," - Follow-up #8094 - [x] Closes #6481 - [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst` This PR should normally raise a warning *each time* when indexed coordinates are created implicitly from a `pd.MultiIndex` object. I updated the tests to create coordinates explicitly using `Coordinates.from_pandas_multiindex()`. I also refactored some parts where a `pd.MultiIndex` could still be passed and promoted internally, with the exception of: - `swap_dims()`: it should raise a warning! Right now the warning message is a bit confusing for this case, but instead of adding a special case we should probably deprecate the whole method? As it is suggested as a TODO comment... This method was to circumvent the limitations of dimension coordinates, which isn't needed anymore (`rename_dims` and/or `set_xindex` is equivalent and less confusing). - `xr.DataArray(pandas_obj_with_multiindex, dims=...)`: I guess it should raise a warning too? - `da.stack(z=...).groupby(""z"")`: it shoudn't raise a warning, but this requires a (heavy?) refactoring of groupby. During building the ""grouper"" objects, `grouper.group1d` or `grouper.unique_coord` may still be built by extracting only the multi-index dimension coordinate. I'd greatly appreciate if anyone familiar with the groupby implementation could help me with this! @dcherian ?","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8140/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1865494976,PR_kwDOAMm_X85Ytlq0,8111,Alignment: allow flexible index coordinate order,4160723,open,0,,,3,2023-08-24T16:18:49Z,2023-09-28T15:58:38Z,,MEMBER,,0,pydata/xarray/pulls/8111," - [x] Closes #7002 - [x] Tests added - [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst` This PR relaxes some of the rules used in alignment for finding the indexes to compare or join together. Those indexes must still be of the same type and must relate to the same set of coordinates (and dimensions), but the order of coordinates is now ignored. It is up to the index to implement the equal / join logic if it needs to care about that order. Regarding `pandas.MultiIndex`, it seems that the level names are ignored when comparing indexes: ```python midx = pd.MultiIndex.from_product([[""a"", ""b""], [0, 1]], names=(""one"", ""two""))) midx2 = pd.MultiIndex.from_product([[""a"", ""b""], [0, 1]], names=(""two"", ""one"")) midx.equals(midx2) # True ``` However, in Xarray the names of the multi-index levels (and their order) matter since each level has its own xarray coordinate. In this PR, `PandasMultiIndex.equals()` and `PandasMultiIndex.join()` thus check that the level names match. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8111/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1869879398,PR_kwDOAMm_X85Y8P4c,8118,Add Coordinates `set_xindex()` and `drop_indexes()` methods,4160723,open,0,,,0,2023-08-28T14:28:24Z,2023-09-19T01:53:18Z,,MEMBER,,0,pydata/xarray/pulls/8118," - Complements #8102 - [ ] Tests added - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [ ] New functions/methods are listed in `api.rst` I don't think that we need to copy most API from Dataset / DataArray to `Coordinates`, but I find it convenient to have some relevant methods there too. For example, building Coordinates from scratch (with custom indexes) before passing the whole coords + indexes bundle around: ```python import dask.array as da import numpy as np import xarray as xr coords = ( xr.Coordinates( coords={""x"": da.arange(100_000_000), ""y"": np.arange(100)}, indexes={}, ) .set_xindex(""x"", DaskIndex) .set_xindex(""y"", xr.indexes.PandasIndex) ) ds = xr.Dataset(coords=coords) # # Dimensions: (x: 100000000, y: 100) # Coordinates: # * x (x) int64 dask.array # * y (y) int64 0 1 2 3 4 5 6 7 8 9 10 ... 90 91 92 93 94 95 96 97 98 99 # Data variables: # *empty* # Indexes: # x DaskIndex ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8118/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1889751633,PR_kwDOAMm_X85Z-5v1,8170,Dataset.from_dataframe: optionally keep multi-index unexpanded,4160723,open,0,,,0,2023-09-11T06:20:17Z,2023-09-11T06:20:17Z,,MEMBER,,1,pydata/xarray/pulls/8170," - [x] Closes #8166 - [ ] Tests added - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [ ] New functions/methods are listed in `api.rst` I added both the `unstack` and `dim` arguments but we can change that. - [ ] update `DataArray.from_series()`","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8170/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1880184915,PR_kwDOAMm_X85ZespA,8143,Deprecate the multi-index dimension coordinate,4160723,open,0,,,0,2023-09-04T12:32:36Z,2023-09-04T12:32:48Z,,MEMBER,,0,pydata/xarray/pulls/8143," - [ ] Tests added - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` This PR adds a `future_no_mindex_dim_coord=False` option that, if set to True, enables the future behavior of `PandasMultiIndex` (i.e., no added dimension coordinate with tuple values): ```python import xarray as xr ds = xr.Dataset(coords={""x"": [""a"", ""b""], ""y"": [1, 2]}) ds.stack(z=[""x"", ""y""]) # # Dimensions: (z: 4) # Coordinates: # * z (z) object MultiIndex # * x (z) # Dimensions: (z: 4) # Coordinates: # * x (z) - [ ] Closes #xxxx - [ ] Tests added - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [ ] New functions/methods are listed in `api.rst` The goal of this PR is to provide a more general solution to indexed coordinate variables, i.e., support arbitrary dimensions and/or duck arrays for those variables while at the same time prevent them from being updated in a way that would invalidate their index. This would solve problems like the one mentioned here: https://github.com/pydata/xarray/issues/1650#issuecomment-1697237429 @shoyer I've tried to implement what you have suggested in https://github.com/pydata/xarray/pull/4979#discussion_r589798510. It would be nice indeed if eventually we could get rid of `IndexVariable`. It won't be easy to deprecate it until we finish the index refactor (i.e., all methods listed in #6293), though. Also, I didn't find an easy way to refactor that class as it has been designed too closely around a 1-d variable backed by a `pandas.Index`. So the approach implemented in this PR is to keep using `IndexVariable` for PandasIndex until we can deprecate / remove it later, and for the other cases use `Variable` with data wrapped in a custom `IndexedCoordinateArray` object. The latter solution (wrapper) doesn't always work nicely, though. For example, several methods of `Variable` expect that `self._data` directly returns a duck array (e.g., a dask array or a chunked duck array). A wrapped duck array will result in unexpected behavior there. We could probably add some checks / indirection or extend the wrapper API... But I wonder if there wouldn't be a more elegant approach? More generally, which operations should we allow / forbid / skip for an indexed coordinate variable? - Set array items in-place? Do not allow. - Replace data? Do not allow. - (Re)Chunk? - Load lazy data? - ... ? (Note: we could add `Index.chunk()` and `Index.load()` methods in order to allow an Xarray index implement custom logic for the two latter cases like, e.g., convert a DaskIndex to a PandasIndex during load, see #8128). cc @andersy005 (some changes made here may conflict with what you are refactoring in #8075). ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8124/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1875631817,PR_kwDOAMm_X85ZPnjq,8128,Add Index.load() and Index.chunk() methods,4160723,open,0,,,0,2023-08-31T14:16:27Z,2023-08-31T15:49:06Z,,MEMBER,,1,pydata/xarray/pulls/8128," - [ ] Closes #xxxx - [ ] Tests added - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [ ] New functions/methods are listed in `api.rst` As mentioned in #8124, it gives more control to custom Xarray indexes on what best to do when the Dataset / DataArray `load()` and `chunk()` counterpart methods are called. `PandasIndex.load()` and `PandasIndex.chunk()` always return self (no action required). For a DaskIndex, we might want to return a PandasIndex (or another non-lazy index) from `load()` and rebuild a DaskIndex object from `chunk()` (rechunk).","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8128/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1412901282,PR_kwDOAMm_X85A_96j,7182,add MultiPandasIndex helper class,4160723,open,0,,,2,2022-10-18T09:42:58Z,2023-08-23T16:30:28Z,,MEMBER,,1,pydata/xarray/pulls/7182," - [ ] Closes #xxxx - [ ] Tests added - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [ ] New functions/methods are listed in `api.rst` This PR adds a `xarray.indexes.MultiPandasIndex` helper class for building custom, meta-indexes that encapsulate multiple `PandasIndex` instances. Unlike `PandasMultiIndex`, the meta-index classes inheriting from this helper class may encapsulate loosely coupled (pandas) indexes, with coordinates of arbitrary dimensions (each coordinate must be 1-dimensional but an Xarray index may be created from coordinates with differing dimensions). Early prototype in this [notebook](https://notebooksharing.space/view/3d599addf8bd6b06a6acc241453da95e28c61dea4281ecd194fbe8464c9b296f#displayOptions=) TODO / TO FIX: - How to allow custom `__init__` options in subclasses be passed to all the `type(self)(new_indexes)` calls inside the `MultiPandasIndex` ""base"" class? This could be done via `**kwargs` passed through... However, mypy will certainly complain (Liskov Substitution Principle). - Is `MultiPandasIndex` a good name for this helper class?","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7182/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1364798843,PR_kwDOAMm_X84-hLRI,7004,Rework PandasMultiIndex.sel internals,4160723,open,0,,,2,2022-09-07T14:57:29Z,2022-09-22T20:38:41Z,,MEMBER,,0,pydata/xarray/pulls/7004," - [x] Closes #6838 - [ ] Tests added - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` This PR hopefully improves how are handled the labels that are provided for multi-index level coordinates in `.sel()`. More specifically, slices are handled in a cleaner way and it is now allowed to provide array-like labels. `PandasMultiIndex.sel()` relies on the underlying `pandas.MultiIndex` methods like this: - use ``get_loc`` when all levels are provided with each a scalar label (no slice, no array) - always drops the index and returns scalar coordinates for each multi-index level - use ``get_loc_level`` when only a subset of levels are provided with scalar labels only - may collapse one or more levels of the multi-index (dropped levels result in scalar coordinates) - if only one level remains: renames the dimension and the corresponding dimension coordinate - use ``get_locs`` for all other cases. - always keeps the multi-index and its coordinates (even if only one item or one level is selected) This yields a predictable behavior: as soon as one of the provided labels is a slice or array-like, the multi-index and all its level coordinates are kept in the result. Some cases illustrated below (I compare this PR with an older release due to the errors reported in #6838): ```python import xarray as xr import pandas as pd midx = pd.MultiIndex.from_product([list(""abc""), range(4)], names=(""one"", ""two"")) ds = xr.Dataset(coords={""x"": midx}) # # Dimensions: (x: 12) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c' # * two (x) int64 0 1 2 3 0 1 2 3 0 1 2 3 # Data variables: # *empty* ``` ```python ds.sel(one=""a"", two=0) # this PR # # # Dimensions: () # Coordinates: # x object ('a', 0) # one # Dimensions: () # Coordinates: # x object ('a', 0) # Data variables: # *empty* # ``` ```python ds.sel(one=""a"") # this PR: # # # Dimensions: (two: 4) # Coordinates: # * two (two) int64 0 1 2 3 # one # Dimensions: (two: 4) # Coordinates: # * two (two) int64 0 1 2 3 # Data variables: # *empty* # ``` ```python ds.sel(one=slice(""a"", ""b"")) # this PR # # # Dimensions: (x: 8) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' # * two (x) int64 0 1 2 3 0 1 2 3 # Data variables: # *empty* # # v2022.3.0 # # # Dimensions: (two: 8) # Coordinates: # * two (two) int64 0 1 2 3 0 1 2 3 # Data variables: # *empty* # ``` ```python ds.sel(one=""a"", two=slice(1, 1)) # this PR # # # Dimensions: (x: 1) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' # * two (x) int64 1 # Data variables: # *empty* # # v2022.3.0 # # # Dimensions: (x: 1) # Coordinates: # * x (x) MultiIndex # - one (x) object 'a' # - two (x) int64 1 # Data variables: # *empty* # ``` ```python ds.sel(one=[""b"", ""c""], two=[0, 2]) # this PR # # # Dimensions: (x: 4) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'b' 'b' 'c' 'c' # * two (x) int64 0 2 0 2 # Data variables: # *empty* # # v2022.3.0 # # ValueError: Vectorized selection is not available along coordinate 'one' (multi-index level) # ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7004/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull