id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
1389295853,I_kwDOAMm_X85Szvjt,7099,Pass arbitrary options to sel(),4160723,open,0,,,4,2022-09-28T12:44:52Z,2024-04-30T00:44:18Z,,MEMBER,,,,"### Is your feature request related to a problem?
Currently `.sel()` accepts two options `method` and `tolerance`. These are relevant for default (pandas) indexes but not necessarily for other, custom indexes.
It would be also useful for custom indexes to expose their own selection options, e.g.,
- index query optimization like the `dualtree` flag of [sklearn.neighbors.KDTree.query](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query)
- k-nearest neighbors selection with the creation of a new ""k"" dimension (+ coordinate / index) with user-defined name and size.
From #3223, it would be nice if we could also pass distinct options values per index.
What would be a good API for that?
### Describe the solution you'd like
Some ideas:
A. Allow passing a tuple `(labels, options_dict)` as indexer value
```python
ds.sel(x=([0, 2], {""method"": ""nearest""}), y=3)
```
B. Expose an `options` kwarg that would accept a nested dict
```python
ds.sel(x=[0, 2], y=3, options={""x"": {""method"": ""nearest""}})
```
Option A does not look very readable. Option B is slightly better, although the nested dictionary is not great.
Any other ideas? Some sort of context manager? Some `Index` specific API?
### Describe alternatives you've considered
The API proposed in #3223 would look great if `method` and `tolerance` were the only accepted options, but less so for arbitrary options.
### Additional context
_No response_","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7099/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1861543091,I_kwDOAMm_X85u9OSz,8097,Documentation rendering issues (dark mode),4160723,open,0,,,2,2023-08-22T14:06:03Z,2024-02-13T02:31:10Z,,MEMBER,,,,"### What is your issue?
There is a couple of rendering issues in Xarray's documentation landing page, especially with the dark mode.
- we should display two versions of of the logo in the light vs. dark mode (note: if the logo is in the svg format, it may be possible to add CSS classes so that it renders consistently with the active mode)
- same for the images in the section cards (would be nice also to display all the images with the same width / height)
- if possible, it would be nice moving the twitter logo just next to the github logo (upper right) with consistent styling.
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8097/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
667864088,MDU6SXNzdWU2Njc4NjQwODg=,4285,Awkward array backend?,4160723,open,0,,,38,2020-07-29T13:53:45Z,2023-12-30T18:47:48Z,,MEMBER,,,,"Just curious if anyone here has thoughts on this.
For more context: [Awkward](https://github.com/scikit-hep/awkward-1.0) is like numpy but for arrays of very arbitrary (dynamic) structure.
I don't know much yet about that library (I've just seen [this SciPy 2020 presentation](https://www.youtube.com/watch?v=WlnUF3LRBj4)), but now I could imagine using xarray for dealing with labelled collections of geometrical / geospatial objects like polylines or polygons.
At this stage, any integration between xarray and awkward arrays would be something highly experimental, but I think this might be an interesting case for flexible arrays (and possibly flexible indexes) mentioned in the [roadmap](http://xarray.pydata.org/en/stable/roadmap.html). There is some discussion here: https://github.com/scikit-hep/awkward-1.0/issues/27.
Does anyone see any other potential use case?
cc @pydata/xarray
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4285/reactions"", ""total_count"": 6, ""+1"": 6, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1989356758,I_kwDOAMm_X852kyzW,8447,Improve discoverability of backend engine options,4160723,open,0,,,5,2023-11-12T11:14:56Z,2023-12-12T20:30:28Z,,MEMBER,,,,"### Is your feature request related to a problem?
Backend engine options are not easily discoverable and we need to know or figure out them before passing it as kwargs to `xr.open_dataset()`.
### Describe the solution you'd like
The solution is similar to the one proposed in #8002 for setting a new index.
The API could look like this:
```python
import xarray as xr
ds = xr.open_dataset(
file_or_obj,
engine=xr.backends.engine(""myengine"").with_options(
option1=True,
option2=100,
),
)
```
where `xr.backends.engine(""myengine"")` returns the `MyEngineBackendEntrypoint` subclass.
We would need to extend the API for `BackendEntrypoint` with a `.with_options()` factory method:
```python
class BackendEntrypoint:
_open_dataset_options: dict[str, Any]
@classmethod
def with_options(cls):
""""""This backend does not implement `with_options`.""""""
raise NotImplementedError()
```
Such that
```python
class MyEngineBackendEntryPoint(BackendEntrypoint):
open_dataset_parameters = (""option1"", ""option2"")
@classmethod
def with_options(
cls,
option1: bool = False,
option2: int | None = None,
):
""""""Get the backend with user-defined options.
Parameters
-----------
option1 : bool, optional
This is option1.
option2 : int, optional
This is option2.
""""""
obj = cls()
# maybe validate the given input options
if option2 is None:
option2 = 1
obj._options = {""option1"": option1, ""option2"": option2}
return obj
def open_dataset(
self,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
*,
drop_variables: str | Iterable[str] | None = None,
**kwargs, # no static checker error (liskov substitution principle)
):
# kwargs passed directly to open_dataset take precedence to options
# or alternatively raise an error?
option1 = kwargs.get(""option1"", self._options.get(""option1"", False))
...
```
Pros:
- Using `.with_options(...)` would seamlessly work with IDE auto-completion, static type checkers (I guess? I'm not sure how static checkers support entry-points), documentation, etc.
- There is no breaking change (`xr.open_dataset(obj, engine=...)` accepts either a string or a BackenEntryPoint subtype but not yet a BackendEntryPoint object) and this feature could be adopted progressively by existing 3rd-party backends.
Cons:
- The possible duplicated declaration of options among `open_dataset_parameters`, `.with_options()` and `.open_dataset()` does not look super nice but I don't really know how to avoid that.
### Describe alternatives you've considered
A `BackendEntryPoint.with_options()` factory is not really needed and we could just go with `BackendEntryPoint.__init__()` instead. Perhaps `with_options` looks a bit clearer and leaves room for more flexibility in `__init__` , though?
### Additional context
cc @jsignell https://github.com/stac-utils/pystac/issues/846#issuecomment-1405758442","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8447/reactions"", ""total_count"": 4, ""+1"": 4, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1148021907,I_kwDOAMm_X85EbWyT,6293,Explicit indexes: next steps,4160723,open,0,,,3,2022-02-23T12:19:38Z,2023-12-01T09:34:28Z,,MEMBER,,,,"#5692 is ~~not merged yet~~ now merged ~~but~~ and we can ~~already~~ start thinking about the next steps. I’m opening this issue to list and track the remaining tasks. @pydata/xarray, do not hesitate to add a comment below if you think about something that is missing here.
## Continue the refactoring of the internals
Although in #5692 everything seems to work with the current pandas index wrappers for dimension coordinates, not all of Xarray's internals have been refactored yet to fully support (or at least be compatible with) custom indexes. Here is a list of `Dataset` / `DataArray` methods that still need to be checked / updated (this list may be incomplete):
- [ ] `as_numpy` (#8001)
- [ ] `broadcast` (#6430, #6481 )
- [ ] `drop_sel` (#6605, #7699)
- [ ] `drop_isel`
- [ ] `drop_dims`
- [ ] `drop_duplicates` (#8499)
- [ ] `transpose`
- [ ] `interpolate_na`
- [ ] `ffill`
- [ ] `bfill`
- [ ] `reduce`
- [ ] `map`
- [ ] `apply`
- [ ] `quantile`
- [ ] `rank`
- [ ] `integrate`
- [ ] `cumulative_integrate`
- [ ] `filter_by_attrs`
- [ ] `idxmin`
- [ ] `idxmax`
- [ ] `argmin`
- [ ] `argmax`
- [ ] `concat` (partially refactored, may not fully work with multi-dimension indexes)
- [ ] `polyfit`
I ended up following a common pattern in #5692 when adding explicit / flexible index support for various features (it is quite generic, though, the actual procedure may vary from one case to another and many steps may be skipped):
- Check if it’s worth adding a new method to the Xarray `Index` base class. There may be several motivations:
- Avoid handling Pandas index objects inside Dataset or DataArray methods (even if we don’t plan to fully support custom indexes for everything, it is preferable to put this logic behind the `PandasIndex` or `PandasMultiIndex` wrapper classes for clarity and also if eventually we want to make Xarray less dependent on Pandas)
- We want a specific implementation rather than relying on the `Variable`’s corresponding method for speed-up or for other reasons, e.g.,
- `IndexVariable.concat` exists to avoid unnecessary Pandas/Numpy conversions ; in #5692 `PandasIndex.concat` has the same logic and will fully replace the former if/once we get rid of `IndexVariable`
- `PandasIndex.roll` reuses `pandas.Index` indexing and `append` capabilities
- `Index` API closely follows DataArray, Dataset and Variable API (i.e., same method names) for consistency
- Within the Dataset or DataArray method, first call the `Index` API (if it exists) to create new indexes
- The `Indexes` class (i.e., the `.xindexes` property returns an instance of this class) provides convenient API for iterating through indexes (e.g., get a list of unique indexes, get all coordinates or dimensions for a given index, etc.)
- If there’s no implementation for the called `Index` API, either raise an error or fallback to calling the `Variable` API (below) depending on the case
- Create new coordinate variables for each of the new indexes using `Index.create_variables`
- It is possible to pass a dict of current coordinate variables to `Index.create_variables` ; it is used to propagate variable metadata (`dtype`, `attrs` and `encoding`)
- Not all indexes should create new coordinate variables, only those for which it is possible to reuse index data as coordinate variable data (like Pandas indexes)
- Iterate through the variables and call the `Variable` API (if it exists)
- Skip new coordinate variables created at the previous step (just reuse it)
- Propagate the indexes that are not affected by the operation and clean up all indexes, i.e., ensure consistency between indexes and coordinate variables
- There is a couple of convenient methods that have been added in #5692 for that purpose: `filter_indexes_from_coords` and `assert_no_index_corrupted`
- Replace indexes and variables, e.g., using `_replace`, `_replace_with_new_dims` or `_overwrite_indexes` methods
## Relax all constraints related to “dimension (index) coordinates” in Xarray
- [x] Allow multi-dimensional variables with the name matching one of its dimensions: #2233 #2405 (https://github.com/pydata/xarray/pull/2405#issuecomment-419969570)
- #7989
## Indexes repr
- [x] Add an `Indexes` section to Dataset and DataArray reprs
- #6795
- #7185
- [ ] Make the repr of `Indexes` (i.e., `.xindexes` property) consistent with the repr of `Coordinates` (`.coords` property)
- [x] Add `Index._repr_inline_` for tweaking the inline representation of each index shown in the reprs above
- #7183
## Public API for assigning and (re)setting indexes
There is no public API yet for creating and/or assigning existing indexes to Dataset and DataArray objects.
- [ ] Enable and/or document the `indexes` parameter in Dataset and DataArray constructors
- [ ] Depreciate the implicit creation of pandas multi-index wrappers (and their corresponding coordinates) from anything passed via the `data`, `data_vars` or `coords` arguments in favor of a more explicit way to pass it.
- [ ] https://github.com/pydata/xarray/issues/6633 (pass empty dictionary)
- #6392
- #7214
- #7368
- [x] Add `set_xindex` and `drop_indexes` methods
- #6849
- #6971
- Depreciate `set_index` and `reset_index`? See https://github.com/pydata/xarray/issues/4366#issuecomment-920458966
We still need to figure out how best we can (1) assign existing indexes (possibly with their coordinates) and (2) pass index build options.
## Other public API for index-based operations
To fully leverage the power and flexibility of custom indexes, we might want to update some parts of Xarray’s public API in order to allow passing arbitrary options per index. For example:
- [ ] `sel`: the current `method` and `tolerance` may not be relevant for all indexes, pass extra arguments to Scipy's [cKDTree.query](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.query.html#scipy.spatial.cKDTree.query), etc. #7099
- [ ] `align`: #2217
Also:
- [ ] Make public the `Indexes` API as it provides convenient methods that might be useful for end-users
- [ ] Import the `Index` base class into Xarray’s main namespace (i.e., `xr.Index`)? Also `PandasIndex` and `PandasMultiIndex`? The latter may be useful if we depreciate `set_index(append=True)` and/or if we depreciate “unpacking” `pandas.MultiIndex` objects to coordinates when given as `coords` in the Dataset / DataArray constructors.
- [ ] Add references in docstrings (https://github.com/pydata/xarray/pull/5692#discussion_r820117354).
## Documentation
- [ ] User guide:
- [x] Update the “Terminology” section: “Index” may include custom indexes, review “Dimension coordinate” / “Non-dimension coordinate” as “Indexed coordinate” / “Non-indexed coordinate”
- [ ] Update the “Data structure” section such that it clearly mentions indexes as 1st class citizen of the Xarray data model
- [ ] Maybe update other parts of the documentation that refer to the concept of “dimension coordinate”
- [ ] API reference:
- [ ] add `Indexes` API
- [ ] add `Index` API: #6975
- [ ] Xarray internals: add a subsection on how to add custom indexes, maybe with some basic examples: #6975
- [ ] Update development roadmap section
## Index types and helper classes built in Xarray
- [ ] Since a lot of potential use-cases for custom indexes may consist in adding some extra logic on top of one or more pandas indexes along one or more dimensions (i.e., “meta-indexes”), it might be worth providing a helper `Index` abstract subclass that would basically dispatch the given arguments to the corresponding, encapsulated `PandasIndex` instances and then merge the results
- #7182
- [ ] Depreciate `PandasMultiIndex` dimension coordinate?
## 3rd party indexes
- [ ] Add custom index entrypoint / plugin system, similarly to storage backend entrypoints
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6293/reactions"", ""total_count"": 12, ""+1"": 6, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 6, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1890893841,I_kwDOAMm_X85wtMAR,8171,Fancy reprs,4160723,open,0,,,10,2023-09-11T16:46:43Z,2023-09-15T21:07:52Z,,MEMBER,,,,"### What is your issue?
In Xarray we already have the plain-text and html reprs, which is great.
Recently, I've tried [anywidget](https://anywidget.dev/) and I think that it has potential to overcome some of the limitations of the current repr and possibly go well beyond it.
The main advantages of anywidget:
- it is broadly compatible with jupyter-like front-ends (Jupyterlab, notebook, vscode, colab, etc.), although I haven't tested it myself on all those front-ends yet.
- it is super easy to get started: almost no project setup (build, packaging) is required before experimenting with it, although it still requires writing Javascript / HTML / CSS, etc..
I don't think we should replace the current html repr (it is still useful to have a basic, pure HTML/CSS version), but having a new widget could improve some aspects like not including the whole CSS each time an object repr is displayed, removing some HTML/CSS hacks... and actually has much more potential since we would have the whole javascript ecosystem at our fingertips (quick plots, etc.). Also bi-directional communication with Python is possible.
I'm opening this issue to brainstorm about what would be nice to have in widget-based Xarray reprs:
- fancy hover effects (e.g., highlight all variables sharing common dimensions, coordinates sharing a common index, etc.)
- more icons next to each variable reprs (attributes, array repr, quick plot? quick map?)
- ... ?
cc @pydata/xarray ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8171/reactions"", ""total_count"": 5, ""+1"": 3, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 2, ""eyes"": 0}",,,13221727,issue
1889195671,I_kwDOAMm_X85wmtaX,8166,Dataset.from_dataframe: deprecate expanding the multi-index,4160723,open,0,,,3,2023-09-10T15:54:31Z,2023-09-11T06:20:50Z,,MEMBER,,,,"### What is your issue?
Let's continue here the discussion about changing the behavior of Dataset.from_dataframe (see https://github.com/pydata/xarray/pull/8140#issuecomment-1712485626).
> The current behaviour of Dataset.from_dataframe where it always unstacks feels wrong to me.
> To me, it seems sensible that Dataset.from_dataframe(df) automatically creates a Dataset with PandasMultiIndex if df has a MultiIndex. The user can then use that or quite easily unstack to a dense or sparse array.
If we don't unstack anymore the multi-index in `Dataset.from_dataframe`, are we OK that the ""Dataset -> DataFrame -> Dataset"" round-trip will not yield expected results unless we unstack explicitly?
```python
ds = xr.Dataset(
{""foo"": ((""x"", ""y""), [[1, 2], [3, 4]])},
coords={""x"": [""a"", ""b""], ""y"": [1, 2]},
)
df = ds.to_dataframe()
ds2 = xr.Dataset.from_dataframe(df, dim=""z"")
ds2.identical(ds) # False
ds2.unstack(""z"").identical(ds) # True
```
cc @max-sixty @dcherian
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8166/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1364388790,I_kwDOAMm_X85RUuu2,7002,Custom indexes and coordinate (re)ordering,4160723,open,0,,,2,2022-09-07T09:44:12Z,2023-08-23T14:35:32Z,,MEMBER,,,,"### What is your issue?
(From https://github.com/pydata/xarray/issues/5647#issuecomment-946546464).
The current alignment logic (as refactored in #5692) requires that two compatible indexes (i.e., of the same type) must relate to one or more coordinates with matching names but also in a matching order.
For some multi-coordinate indexes like `PandasMultiIndex` this makes sense. However, for other multi-coordinate indexes (e.g., staggered grid indexes) the order of the coordinates doesn't matter much.
Possible options:
1. Setting new Xarray indexes may reorder the coordinate variables, possibly via `Index.create_variables()`, to ensure consistent order
2. Xarray indexes must implement a `Index.matching_key` abstract property in order to support re-indexing and alignment.
3. Take care of coordinate order (and maybe other things) inside `Index.join` and `Index.equals`, e.g., for `PandasMultiIndex` maybe reorder the levels beforehand.
- pros: more flexible
- cons: not great to implicitly reorder levels if it's a costly operation?
4. Find matching indexes using a two-passes approach: (1) group all indexes by dimension name and (2) check compatibility between the indexes listed in each group.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7002/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1812008663,I_kwDOAMm_X85sAQ7X,8002,Improve discoverability of index build options,4160723,open,0,,,2,2023-07-19T13:54:09Z,2023-07-19T17:48:51Z,,MEMBER,,,,"### Is your feature request related to a problem?
Currently `Dataset.set_xindex(coord_names, index_cls=None, **options)` allows passing index build options (if any) via the `**options` arguments. Those options are not easily discoverable, though (no auto-completion, etc.).
### Describe the solution you'd like
What about something like this?
```python
ds.set_xindex(""x"", MyCustomIndex.with_options(foo=1, bar=True))
# or
ds.set_xindex(""x"", *MyCustomIndex.with_options(foo=1, bar=True))
```
This would require adding a `.with_options()` class method that can be overridden in Index subclasses (optional):
```python
# xarray.core.indexes
class Index:
@classmethod
def with_options(cls) -> tuple[type[Self], dict[str, Any]]:
return cls, {}
```
```python
# third-party code
from xarray.indexes import Index
class MyCustomIndex(Index):
@classmethod
def with_options(cls, foo: int = 0, bar: bool = False) -> tuple[type[Self], dict[str, Any]]:
""""""Set a new MyCustomIndex with options.
Parameters
------------
foo : int, optional
The foo option (default: 1).
bar : bool, optional
The bar option (default: False).
""""""
return cls, {""foo"": foo, ""bar"": bar}
```
Thoughts?
### Describe alternatives you've considered
Build options are also likely defined in the Index constructor, e.g.,
```python
# third-party code
from xarray.indexes import Index
class MyCustomIndex(Index):
def __init__(self, data, foo=0, bar=False):
...
```
However, the Index constructor is not public API (only used internally and indirectly in Xarray when setting a new index from existing coordinates).
Any other idea?
### Additional context
_No response_","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8002/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1151751524,I_kwDOAMm_X85EplVk,6308,xr.doctor(): diagnostics on a Dataset / DataArray ?,4160723,open,0,,,4,2022-02-26T12:10:07Z,2022-11-07T15:28:35Z,,MEMBER,,,,"### Is your feature request related to a problem?
Recently I've been reading through various issue reports here and there (GH issues and discussions, forums, etc.) and I'm wondering if it wouldn't be useful to have some function in Xarray that inspects a Dataset or DataArray and reports a bunch of diagnostics, so that the community could better help troubleshooting performance or other issues faced by users.
It's not always obvious where to look (e.g., number of chunks of a dask array, number of tasks of a dask graph, etc.) to diagnose issues, sometimes even for experienced users.
### Describe the solution you'd like
A `xr.doctor(dataset_or_dataarray)` top-level function (or `Dataset.doctor()` / `DataArray.doctor()` methods) that would perform a battery of checks and return helpful diagnostics, e.g.,
- ""Data variable ""x"" wraps a dask array that contains a lot of tasks, which may affect performance""
- ""Data variable ""x"" wraps a dask array that contains many small chunks""
- ... possibly many other diagnostics?
### Describe alternatives you've considered
None
### Additional context
_No response_","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6308/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1325016510,I_kwDOAMm_X85O-iW-,6860,Align with join='override' may update index coordinate metadata,4160723,open,0,,,0,2022-08-01T21:45:13Z,2022-08-01T21:49:41Z,,MEMBER,,,,"### What happened?
It seems that `align(*, join=""override"")` may have affected and still may affect the metadata of index coordinate data in an incorrect way. See the MCV example below.
cf. @keewis' original https://github.com/pydata/xarray/pull/6857#discussion_r934425142.
### What did you expect to happen?
Index coordinate metadata unaffected by alignment (i.e., metadata is passed through object -> aligned object for each object), like for align with other join methods.
### Minimal Complete Verifiable Example
```Python
import xarray as xr
ds1 = xr.Dataset(coords={""x"": (""x"", [1, 2, 3], {""foo"": 1})})
ds2 = xr.Dataset(coords={""x"": (""x"", [1, 2, 3], {""bar"": 2})})
aligned1, aligned2 = xr.align(ds1, ds2, join=""override"")
aligned1.x.attrs
# v2022.03.0 -> {'foo': 1}
# v2022.06.0 -> {'foo': 1, 'bar': 2}
# PR #6857 -> {'foo': 1}
# expected -> {'foo': 1}
aligned2.x.attrs
# v2022.03.0 -> {}
# v2022.06.0 -> {'foo': 1, 'bar': 2}
# PR #6857 -> {'foo': 1, 'bar': 2}
# expected -> {'bar': 2}
aligned11, aligned22 = xr.align(ds1, ds2, join=""inner"")
aligned11.x.attrs
# {'foo': 1}
aligned22.x.attrs
# {'bar': 2}
```
### MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
### Relevant log output
_No response_
### Anything else we need to know?
_No response_
### Environment
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:36:15)
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: (None, 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1
xarray: 0.21.2.dev137+g30023a484
pandas: 1.4.0
numpy: 1.22.2
scipy: 1.7.1
netCDF4: 1.5.8
pydap: installed
h5netcdf: 0.11.0
h5py: 3.4.0
Nio: None
zarr: 2.6.1
cftime: 1.5.2
nc_time_axis: 1.2.0
PseudoNetCDF: installed
rasterio: 1.2.10
cfgrib: 0.9.8.5
iris: 3.0.4
bottleneck: 1.3.2
dask: 2022.01.1
distributed: 2022.01.1
matplotlib: 3.4.3
cartopy: 0.20.1
seaborn: 0.11.1
numbagg: 0.2.1
fsspec: 0.8.5
cupy: None
pint: 0.16.1
sparse: 0.13.0
flox: None
numpy_groupies: None
setuptools: 57.4.0
pip: 20.2.4
conda: None
pytest: 6.2.5
IPython: 7.27.0
sphinx: 3.3.1
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6860/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1005623261,I_kwDOAMm_X8478Jfd,5812,Check explicit indexes when comparing two xarray objects,4160723,open,0,,,2,2021-09-23T16:19:32Z,2021-09-24T15:59:02Z,,MEMBER,,,,"
**Is your feature request related to a problem? Please describe.**
With the explicit index refactor, two Dataset or DataArray objects `a` and `b` may have the same variables / coordinates and attributes but different indexes.
**Describe the solution you'd like**
I'd suggest that `a.identical(b)` by default also checks for equality between`a.xindexes` and `b.xindexes`.
One drawback is when we want to check either the attributes or the indexes but not both. Should we add options like suggested in #5733 then?
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5812/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1006335177,I_kwDOAMm_X847-3TJ,5814,Confusing assertion message when comparing datasets with differing coordinates,4160723,open,0,,,1,2021-09-24T10:50:11Z,2021-09-24T15:17:00Z,,MEMBER,,,,"
**What happened**:
When two datasets `a` and `b` have only differing coordinates, `xr.testing.assert_*` may output a confusing message that also reports differing data variables (although strictly equal/identical) sharing common dimensions with those differing coordinates. I guess it is because when comparing the data variables we compare `DataArray` objects (thus including the coordinates).
**What you expected to happen**:
An output assertion error message that shows only the differing coordinates.
**Minimal Complete Verifiable Example**:
```python
>>> import xarray as xr
>>> a = xr.Dataset(data_vars={""var"": (""x"", [10.0, 11.0])}, coords={""x"": [0, 1]})
>>> b = xr.Dataset(data_vars={""var"": (""x"", [10.0, 11.0])}, coords={""x"": [2, 3]})
>>> xr.testing.assert_equal(a, b)
```
```
AssertionError: Left and right Dataset objects are not equal
Differing coordinates:
L * x (x) int64 0 1
R * x (x) int64 2 3
Differing data variables:
L var (x) float64 10.0 11.0
R var (x) float64 10.0 11.0
```
I would rather expect:
```python
>>> xr.testing.assert_equal(a, b)
```
```
AssertionError: Left and right Dataset objects are not equal
Differing coordinates:
L * x (x) int64 0 1
R * x (x) int64 2 3
```
**Anything else we need to know?**:
**Environment**:
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:36:15)
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: (None, 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4
xarray: 0.19.1.dev72+ga8d84c703.d20210901
pandas: 1.3.2
numpy: 1.21.2
scipy: 1.7.1
netCDF4: 1.5.6
pydap: installed
h5netcdf: 0.8.1
h5py: 3.3.0
Nio: None
zarr: 2.6.1
cftime: 1.5.0
nc_time_axis: 1.2.0
PseudoNetCDF: installed
rasterio: 1.2.1
cfgrib: 0.9.8.5
iris: 3.0.4
bottleneck: 1.3.2
dask: 2021.01.1
distributed: 2021.01.1
matplotlib: 3.4.3
cartopy: 0.18.0
seaborn: 0.11.1
numbagg: None
fsspec: 0.8.5
cupy: None
pint: 0.16.1
sparse: 0.11.2
setuptools: 57.4.0
pip: 20.2.4
conda: None
pytest: 6.2.5
IPython: 7.27.0
sphinx: 3.3.1
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5814/reactions"", ""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
977149831,MDU6SXNzdWU5NzcxNDk4MzE=,5732,Coordinates implicitly created when passing a DataArray as coord to Dataset constructor,4160723,open,0,,,3,2021-08-23T15:20:37Z,2021-08-24T14:18:09Z,,MEMBER,,,,"I stumbled on this while working on #5692. Is this intended behavior or unwanted side effect?
**What happened**:
Create a new Dataset by passing a DataArray object as coordinate also add the DataArray coordinates to the dataset:
```python
>>> foo = xr.DataArray([1.0, 2.0, 3.0], coords={""x"": [0, 1, 2]}, dims=""x"")
>>> ds = xr.Dataset(coords={""foo"": foo})
>>> ds
Dimensions: (x: 3)
Coordinates:
* x (x) int64 0 1 2
foo (x) float64 1.0 2.0 3.0
Data variables:
*empty*
```
**What you expected to happen**:
The behavior above seems a bit counter-intuitive to me. I would rather expect no additional coordinates auto-magically added to the dataset, i.e. only one `foo` coordinate in this example:
```python
>>> ds
Dimensions: (x: 3)
Coordinates:
foo (x) float64 1.0 2.0 3.0
Data variables:
*empty*
```
**Environment**:
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.8.6 | packaged by conda-forge | (default, Nov 27 2020, 19:17:44)
[Clang 11.0.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: (None, 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4
xarray: 0.19.0
pandas: 1.1.5
numpy: 1.21.1
scipy: 1.7.0
netCDF4: 1.5.5.1
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.3.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.07.2
distributed: 2021.07.2
matplotlib: 3.3.3
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20201009
pip: 20.3.1
conda: None
pytest: 6.1.2
IPython: 7.25.0
sphinx: 3.3.1
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5732/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
902009258,MDU6SXNzdWU5MDIwMDkyNTg=,5376,Multi-scale datasets and custom indexes,4160723,open,0,,,6,2021-05-26T08:38:00Z,2021-06-02T08:07:38Z,,MEMBER,,,,"I've been wondering if:
- multi-scale datasets are generic enough to implement some related functionality in Xarray, e.g., as new `Dataset` and/or `DataArray` method(s)
- we could leverage custom indexes for that (see the [design notes](https://github.com/pydata/xarray/blob/master/design_notes/flexible_indexes_notes.md))
I'm thinking of an API that would look like this:
```python
# lazily load a big n-d image (full resolution) as a xarray.Dataset
xyz_dataset = ...
# set a new index for the x/y/z coordinates
# (`reduction` and `pre_compute_scales` are optional and passed
# as arguments to `ImagePyramidIndex`)
xyz_dataset.set_index(
('x', 'y', 'z'),
ImagePyramidIndex,
reduction=np.mean,
pre_compute_scales=(2, 2),
)
# get a slice (ImagePyramidIndex will be used to dynamically scale the data
# or load the right pre-computed dataset)
xyz_slice = xyz_dataset.sel_and_rescale(x=slice(...), y=slice(...), z=slice(...))
```
where `ImagePyramidIndex` is not a ""common"" index, i.e., it cannot be used directly with Xarray's `.sel()` nor for data alignment. Using an index here might still make sense for such data extraction and resampling operation IMHO. We could extend the `xarray.Index` API to handle multi-scale datasets, so that `ImagePyramidIndex` could either do the scaling dynamically (maybe using a cache) or just lazily load pre-computed data, e.g., from a [NGFF](https://ngff.openmicroscopy.org/latest/) / OME-Zarr dataset... Both the implementation and functionality can be pretty flexible. Custom options may be passed through the Xarray API either when creating the index or when extracting a data slice.
A hierarchical structure of `xarray.Dataset` objects is already discussed in #4118 for multi-scale datasets, but I'm wondering if using indexes could be an alternative approach (it could also be complementary, i.e., `ImagePyramidIndex` could rely on such hierarchical structure under the hood).
I'd see some advantages of the index approach, although this is the perspective from a naive user who is not working with multi-scale datasets:
- it is flexible: the scaling may be done dynamically without having to store the results in a hierarchical collection with some predefined discrete levels
- we don't need to expose anything other than a simple `xarray.Dataset` + a ""black-box"" index in which we abstract away all the implementation details. The API example shown above seems more intuitive to me than having to deal directly with Dataset groups.
- Xarray will provide a plugin system for 3rd party indexes, allowing for more `ImagePyramidIndex` variants. Xarray already provides an extension mechanism (accessors) for methods like `sel_and_rescale` in the example above...
That said, I'd also see the benefits of exposing Dataset groups more transparently to users (in case those are loaded from a store that supports it).
cc @thewtex @joshmoore @d-v-b","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5376/reactions"", ""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 1}",,,13221727,issue
869721207,MDU6SXNzdWU4Njk3MjEyMDc=,5226,Attributes encoding compatibility between backends,4160723,open,0,,,1,2021-04-28T09:11:19Z,2021-04-28T15:42:42Z,,MEMBER,,,,"**What happened**:
Let's create an Zarr dataset with some ""less common"" dtype and fill value, open it with Xarray and save the dataset as NetCDF:
```python
import xarray as xr
import zarr
g = zarr.group()
g.create('arr', shape=3, fill_value='z', dtype='Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
libhdf5: None
libnetcdf: None
xarray: 0.17.0
pandas: 1.0.3
numpy: 1.18.1
scipy: 1.3.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.1
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.11.0
distributed: 2.14.0
matplotlib: 3.1.1
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 46.1.3.post20200325
pip: 19.2.3
conda: None
pytest: 5.4.1
IPython: 7.13.0
sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5226/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
733077617,MDU6SXNzdWU3MzMwNzc2MTc=,4555,Vectorized indexing (isel) of chunked data with 1D indices gives weird chunks,4160723,open,0,,,1,2020-10-30T10:55:33Z,2021-03-02T17:36:48Z,,MEMBER,,,,"
**What happened**:
Applying `.isel()` on a DataArray or Dataset with chunked data using 1-d indices (either stored in a `xarray.Variable` or a `numpy.ndarray`) gives weird chunks (i.e., a lot of chunks with small sizes).
**What you expected to happen**:
More consistent chunk sizes.
**Minimal Complete Verifiable Example**:
Let's create a chunked DataArray
```python
In [1]: import numpy as np
In [2]: import xarray as xr
In [3]: da = xr.DataArray(np.random.rand(100), dims='points').chunk(50)
In [4]: da
Out[4]:
dask.array, shape=(100,), dtype=float64, chunksize=(50,), chunktype=numpy.ndarray>
Dimensions without coordinates: points
```
Select random indices results in a lot of small chunks
```python
In [5]: indices = xr.Variable('nodes', np.random.choice(np.arange(100, dtype='int'), size=10))
In [6]: da_sel = da.isel(points=indices)
In [7]: da_sel.chunks
Out[7]: ((1, 1, 3, 1, 1, 3),)
```
What I would expect
```python
In [8]: da.data.vindex[indices.data].chunks
Out[8]: ((10,),)
```
This works fine with 2+ dimensional indexers, e.g.,
```python
In [9]: indices_2d = xr.Variable(('x', 'y'), np.random.choice(np.arange(100), size=(10, 10)))
In [10]: da_sel_2d = da.isel(points=indices_2d)
In [11]: da_sel_2d.chunks
Out[11]: ((10,), (10,))
```
**Anything else we need to know?**:
I suspect the issue is here:
https://github.com/pydata/xarray/blob/063606b90946d869e90a6273e2e18ed24bffb052/xarray/core/variable.py#L616-L617
In the example above I think we still want vectorized indexing (i.e., call `dask.array.Array.vindex[]` instead of `dask.array.Array[]`).
**Environment**:
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.8.3 | packaged by conda-forge | (default, Jun 1 2020, 17:21:09)
[Clang 9.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.UTF-8
libhdf5: None
libnetcdf: None
xarray: 0.16.1
pandas: 1.1.3
numpy: 1.19.1
scipy: 1.5.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.19.0
distributed: 2.25.0
matplotlib: 3.3.1
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 47.3.1.post20200616
pip: 20.1.1
conda: None
pytest: 5.4.3
IPython: 7.16.1
sphinx: 3.2.1
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4555/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
187873247,MDU6SXNzdWUxODc4NzMyNDc=,1094,Supporting out-of-core computation/indexing for very large indexes,4160723,open,0,,,5,2016-11-08T00:56:56Z,2021-01-26T20:09:12Z,,MEMBER,,,,"(Follow-up of discussion here https://github.com/pydata/xarray/pull/1024#issuecomment-258524115).
xarray + dask.array successfully enable out-of-core computation for very large variables that doesn't fit in memory. One current limitation is that the indexes of a `Dataset` or `DataArray`, which rely on `pandas.Index`, are still fully loaded into memory (it will be soon loaded eagerly after #1024). In many cases this is not a problem, as the sizes of 1-dimensional indexes are usually much smaller than the sizes of n-dimensional variables or coordinates.
However, this may be problematic in some specific cases where we have to deal with very large indexes. As an example, big unstructured meshes often have coordinates (x, y, z) arranged as 1-d arrays of length that equals the number of nodes, which can be very large!! (See, e.g., [ugrid conventions](http://ugrid-conventions.github.io/ugrid-conventions/)).
It would be very nice if xarray could also help for these use cases. Therefore I'm wondering if (and how) out-of-core support can be extended to indexes and indexing.
I've briefly looked at the documentation on `dask.dataframe`, and a first naive approach I have in mind would be to allow partitioning an index into multiple, contiguous indexes. For label-based indexing, we might for example map `indexing.convert_label_indexer` to each partition and combine the returned indexers.
My knowledge of dask is very limited, though. So I've no doubt that this suggestion is very simplistic and not very efficient, or that there are better approaches. I'm also certainly missing other issues not directly related to indexing.
Any thoughts?
cc @shoyer @mrocklin
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1094/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue