home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

30 rows where state = "open" and user = 4160723 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, draft, created_at (date), updated_at (date)

type 2

  • issue 18
  • pull 12

state 1

  • open · 30 ✖

repo 1

  • xarray 30
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1389295853 I_kwDOAMm_X85Szvjt 7099 Pass arbitrary options to sel() benbovy 4160723 open 0     4 2022-09-28T12:44:52Z 2024-04-30T00:44:18Z   MEMBER      

Is your feature request related to a problem?

Currently .sel() accepts two options method and tolerance. These are relevant for default (pandas) indexes but not necessarily for other, custom indexes.

It would be also useful for custom indexes to expose their own selection options, e.g.,

  • index query optimization like the dualtree flag of sklearn.neighbors.KDTree.query
  • k-nearest neighbors selection with the creation of a new "k" dimension (+ coordinate / index) with user-defined name and size.

From #3223, it would be nice if we could also pass distinct options values per index.

What would be a good API for that?

Describe the solution you'd like

Some ideas:

A. Allow passing a tuple (labels, options_dict) as indexer value

python ds.sel(x=([0, 2], {"method": "nearest"}), y=3)

B. Expose an options kwarg that would accept a nested dict

python ds.sel(x=[0, 2], y=3, options={"x": {"method": "nearest"}})

Option A does not look very readable. Option B is slightly better, although the nested dictionary is not great.

Any other ideas? Some sort of context manager? Some Index specific API?

Describe alternatives you've considered

The API proposed in #3223 would look great if method and tolerance were the only accepted options, but less so for arbitrary options.

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7099/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2227413822 PR_kwDOAMm_X85rz7ZX 8911 Refactor swap dims benbovy 4160723 open 0     5 2024-04-05T08:45:49Z 2024-04-17T16:46:34Z   MEMBER   1 pydata/xarray/pulls/8911
  • [ ] Attempt at fixing #8646
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

I've tried here re-implementing swap_dims using rename_dims, drop_indexes and set_xindex. This fixes the example in #8646 but unfortunately this fails at handling the pandas multi-index special case (i.e., a single non-dimension coordinate wrapping a pd.MultiIndex that is promoted to a dimension coordinate in swap-dims auto-magically results in a PandasMultiIndex with both dimension and level coordinates).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8911/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
2215059449 PR_kwDOAMm_X85rJr7c 8888 to_base_variable: coerce multiindex data to numpy array benbovy 4160723 open 0     3 2024-03-29T10:10:42Z 2024-03-29T15:54:19Z   MEMBER   0 pydata/xarray/pulls/8888
  • [x] Closes #8887, and probably supersedes #8809
  • [x] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • ~~New functions/methods are listed in api.rst~~

@slevang this should also make work your test case added in #8809. I haven't added it here, instead I added a basic check that should be enough.

I don't really understand why the serialization backends (zarr?) do not seem to work with the PandasMultiIndexingAdapter.__array__() implementation, which should normally coerce the multi-index levels into numpy arrays as needed. Anyway, I guess that coercing it early like in this PR doesn't hurt and may avoid the confusion of a non-indexed, isolated coordinate variable that still wraps a pandas.MultiIndex.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8888/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1861543091 I_kwDOAMm_X85u9OSz 8097 Documentation rendering issues (dark mode) benbovy 4160723 open 0     2 2023-08-22T14:06:03Z 2024-02-13T02:31:10Z   MEMBER      

What is your issue?

There is a couple of rendering issues in Xarray's documentation landing page, especially with the dark mode.

  • we should display two versions of of the logo in the light vs. dark mode (note: if the logo is in the svg format, it may be possible to add CSS classes so that it renders consistently with the active mode)
  • same for the images in the section cards (would be nice also to display all the images with the same width / height)
  • if possible, it would be nice moving the twitter logo just next to the github logo (upper right) with consistent styling.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8097/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1839199929 PR_kwDOAMm_X85XUl4W 8051 Allow setting (or skipping) new indexes in open_dataset benbovy 4160723 open 0     9 2023-08-07T10:53:46Z 2024-02-03T19:12:48Z   MEMBER   0 pydata/xarray/pulls/8051
  • [x] Closes #6633
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

This PR introduces a new boolean parameter set_indexes=True to xr.open_dataset(), which may be used to skip the creation of default (pandas) indexes when opening a dataset.

Currently works with the Zarr backend:

```python import numpy as np import xarray as xr

example dataset (real dataset may be much larger)

arr = np.random.random(size=1_000_000) xr.Dataset({"x": arr}).to_zarr("dataset.zarr")

xr.open_dataset("dataset.zarr", set_indexes=False, engine="zarr")

<xarray.Dataset>

Dimensions: (x: 1000000)

Coordinates:

x (x) float64 ...

Data variables:

empty

xr.open_zarr("dataset.zarr", set_indexes=False)

<xarray.Dataset>

Dimensions: (x: 1000000)

Coordinates:

x (x) float64 ...

Data variables:

empty

```

I'll add it to the other Xarray backends as well, but I'd like to get your thoughts about the API first.

  1. Do we want to add yet another keyword parameter to xr.open_dataset()? There are already many...
  2. Do we want to add this parameter to the BackendEntrypoint.open_dataset() API?
  3. I'm afraid we must do it if we want this parameter in xr.open_dataset()
  4. this would also make it possible skipping the creation of custom indexes (if any) in custom IO backends
  5. con: if we require set_indexes in the signature in addition to the drop_variables parameter, this is a breaking change for all existing 3rd-party backends. Or should we group set_indexes with the other xarray decoder kwargs? This would feel a bit odd to me as setting indexes is different from decoding data.
  6. Or should we leave this up to the backends?
  7. pros: no breaking change, more flexible (3rd party backends may want to offer more control like choosing between custom indexes and default pandas indexes or skipping the creation of indexes by default)
  8. cons: less discoverable, consistency is not enforced across 3rd party backends (although for such advanced case this is probably OK), not available by default in every backend.

Currently 1 and 2 are implemented in this PR, although as I write this comment I think that I would prefer 3. I guess this depends on whether we prefer open_*** vs. xr.open_dataset(engine="***") and unless I missed something there is still no real consensus about that? (e.g., #7496).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8051/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
667864088 MDU6SXNzdWU2Njc4NjQwODg= 4285 Awkward array backend? benbovy 4160723 open 0     38 2020-07-29T13:53:45Z 2023-12-30T18:47:48Z   MEMBER      

Just curious if anyone here has thoughts on this.

For more context: Awkward is like numpy but for arrays of very arbitrary (dynamic) structure.

I don't know much yet about that library (I've just seen this SciPy 2020 presentation), but now I could imagine using xarray for dealing with labelled collections of geometrical / geospatial objects like polylines or polygons.

At this stage, any integration between xarray and awkward arrays would be something highly experimental, but I think this might be an interesting case for flexible arrays (and possibly flexible indexes) mentioned in the roadmap. There is some discussion here: https://github.com/scikit-hep/awkward-1.0/issues/27.

Does anyone see any other potential use case?

cc @pydata/xarray

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4285/reactions",
    "total_count": 6,
    "+1": 6,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1989356758 I_kwDOAMm_X852kyzW 8447 Improve discoverability of backend engine options benbovy 4160723 open 0     5 2023-11-12T11:14:56Z 2023-12-12T20:30:28Z   MEMBER      

Is your feature request related to a problem?

Backend engine options are not easily discoverable and we need to know or figure out them before passing it as kwargs to xr.open_dataset().

Describe the solution you'd like

The solution is similar to the one proposed in #8002 for setting a new index.

The API could look like this:

```python import xarray as xr

ds = xr.open_dataset( file_or_obj, engine=xr.backends.engine("myengine").with_options( option1=True, option2=100, ), ) ```

where xr.backends.engine("myengine") returns the MyEngineBackendEntrypoint subclass.

We would need to extend the API for BackendEntrypoint with a .with_options() factory method:

```python class BackendEntrypoint: _open_dataset_options: dict[str, Any]

@classmethod
def with_options(cls):
    """This backend does not implement `with_options`."""
    raise NotImplementedError()

```

Such that

```python class MyEngineBackendEntryPoint(BackendEntrypoint): open_dataset_parameters = ("option1", "option2")

@classmethod
def with_options(
    cls,
    option1: bool = False,
    option2: int | None = None,
):
    """Get the backend with user-defined options.

    Parameters
    -----------
    option1 : bool, optional
        This is option1.
    option2 : int, optional
        This is option2.

    """
    obj = cls()

    # maybe validate the given input options
    if option2 is None:
        option2 = 1

    obj._options = {"option1": option1, "option2": option2}

    return obj

def open_dataset(
    self,
    filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
    *,
    drop_variables: str | Iterable[str] | None = None,
    **kwargs,    # no static checker error (liskov substitution principle)
):
    # kwargs passed directly to open_dataset take precedence to options
    # or alternatively raise an error?
    option1 = kwargs.get("option1", self._options.get("option1", False))

    ...

```

Pros:

  • Using .with_options(...) would seamlessly work with IDE auto-completion, static type checkers (I guess? I'm not sure how static checkers support entry-points), documentation, etc.
  • There is no breaking change (xr.open_dataset(obj, engine=...) accepts either a string or a BackenEntryPoint subtype but not yet a BackendEntryPoint object) and this feature could be adopted progressively by existing 3rd-party backends.

Cons:

  • The possible duplicated declaration of options among open_dataset_parameters, .with_options() and .open_dataset() does not look super nice but I don't really know how to avoid that.

Describe alternatives you've considered

A BackendEntryPoint.with_options() factory is not really needed and we could just go with BackendEntryPoint.__init__() instead. Perhaps with_options looks a bit clearer and leaves room for more flexibility in __init__ , though?

Additional context

cc @jsignell https://github.com/stac-utils/pystac/issues/846#issuecomment-1405758442

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8447/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1148021907 I_kwDOAMm_X85EbWyT 6293 Explicit indexes: next steps benbovy 4160723 open 0     3 2022-02-23T12:19:38Z 2023-12-01T09:34:28Z   MEMBER      

5692 is ~~not merged yet~~ now merged ~~but~~ and we can ~~already~~ start thinking about the next steps. I’m opening this issue to list and track the remaining tasks. @pydata/xarray, do not hesitate to add a comment below if you think about something that is missing here.

Continue the refactoring of the internals

Although in #5692 everything seems to work with the current pandas index wrappers for dimension coordinates, not all of Xarray's internals have been refactored yet to fully support (or at least be compatible with) custom indexes. Here is a list of Dataset / DataArray methods that still need to be checked / updated (this list may be incomplete):

  • [ ] as_numpy (#8001)
  • [ ] broadcast (#6430, #6481 )
  • [ ] drop_sel (#6605, #7699)
  • [ ] drop_isel
  • [ ] drop_dims
  • [ ] drop_duplicates (#8499)
  • [ ] transpose
  • [ ] interpolate_na
  • [ ] ffill
  • [ ] bfill
  • [ ] reduce
  • [ ] map
  • [ ] apply
  • [ ] quantile
  • [ ] rank
  • [ ] integrate
  • [ ] cumulative_integrate
  • [ ] filter_by_attrs
  • [ ] idxmin
  • [ ] idxmax
  • [ ] argmin
  • [ ] argmax
  • [ ] concat (partially refactored, may not fully work with multi-dimension indexes)
  • [ ] polyfit

I ended up following a common pattern in #5692 when adding explicit / flexible index support for various features (it is quite generic, though, the actual procedure may vary from one case to another and many steps may be skipped):

  • Check if it’s worth adding a new method to the Xarray Index base class. There may be several motivations:
    • Avoid handling Pandas index objects inside Dataset or DataArray methods (even if we don’t plan to fully support custom indexes for everything, it is preferable to put this logic behind the PandasIndex or PandasMultiIndex wrapper classes for clarity and also if eventually we want to make Xarray less dependent on Pandas)
    • We want a specific implementation rather than relying on the Variable’s corresponding method for speed-up or for other reasons, e.g.,
      • IndexVariable.concat exists to avoid unnecessary Pandas/Numpy conversions ; in #5692 PandasIndex.concat has the same logic and will fully replace the former if/once we get rid of IndexVariable
      • PandasIndex.roll reuses pandas.Index indexing and append capabilities
  • Index API closely follows DataArray, Dataset and Variable API (i.e., same method names) for consistency
  • Within the Dataset or DataArray method, first call the Index API (if it exists) to create new indexes
    • The Indexes class (i.e., the .xindexes property returns an instance of this class) provides convenient API for iterating through indexes (e.g., get a list of unique indexes, get all coordinates or dimensions for a given index, etc.)
    • If there’s no implementation for the called Index API, either raise an error or fallback to calling the Variable API (below) depending on the case
  • Create new coordinate variables for each of the new indexes using Index.create_variables
    • It is possible to pass a dict of current coordinate variables to Index.create_variables ; it is used to propagate variable metadata (dtype, attrs and encoding)
    • Not all indexes should create new coordinate variables, only those for which it is possible to reuse index data as coordinate variable data (like Pandas indexes)
  • Iterate through the variables and call the Variable API (if it exists)
    • Skip new coordinate variables created at the previous step (just reuse it)
  • Propagate the indexes that are not affected by the operation and clean up all indexes, i.e., ensure consistency between indexes and coordinate variables
    • There is a couple of convenient methods that have been added in #5692 for that purpose: filter_indexes_from_coords and assert_no_index_corrupted
  • Replace indexes and variables, e.g., using _replace, _replace_with_new_dims or _overwrite_indexes methods

Relax all constraints related to “dimension (index) coordinates” in Xarray

  • [x] Allow multi-dimensional variables with the name matching one of its dimensions: #2233 #2405 (https://github.com/pydata/xarray/pull/2405#issuecomment-419969570)
  • 7989

Indexes repr

  • [x] Add an Indexes section to Dataset and DataArray reprs
  • 6795

  • 7185

  • [ ] Make the repr of Indexes (i.e., .xindexes property) consistent with the repr of Coordinates (.coords property)
  • [x] Add Index._repr_inline_ for tweaking the inline representation of each index shown in the reprs above
  • 7183

Public API for assigning and (re)setting indexes

There is no public API yet for creating and/or assigning existing indexes to Dataset and DataArray objects.

  • [ ] Enable and/or document the indexes parameter in Dataset and DataArray constructors
    • [ ] Depreciate the implicit creation of pandas multi-index wrappers (and their corresponding coordinates) from anything passed via the data, data_vars or coords arguments in favor of a more explicit way to pass it.
    • [ ] https://github.com/pydata/xarray/issues/6633 (pass empty dictionary)
    • 6392

    • 7214

    • 7368

  • [x] Add set_xindex and drop_indexes methods
    • 6849

    • 6971

    • Depreciate set_index and reset_index? See https://github.com/pydata/xarray/issues/4366#issuecomment-920458966

We still need to figure out how best we can (1) assign existing indexes (possibly with their coordinates) and (2) pass index build options.

Other public API for index-based operations

To fully leverage the power and flexibility of custom indexes, we might want to update some parts of Xarray’s public API in order to allow passing arbitrary options per index. For example:

  • [ ] sel: the current method and tolerance may not be relevant for all indexes, pass extra arguments to Scipy's cKDTree.query, etc. #7099
  • [ ] align: #2217

Also:

  • [ ] Make public the Indexes API as it provides convenient methods that might be useful for end-users
  • [ ] Import the Index base class into Xarray’s main namespace (i.e., xr.Index)? Also PandasIndex and PandasMultiIndex? The latter may be useful if we depreciate set_index(append=True) and/or if we depreciate “unpacking” pandas.MultiIndex objects to coordinates when given as coords in the Dataset / DataArray constructors.
  • [ ] Add references in docstrings (https://github.com/pydata/xarray/pull/5692#discussion_r820117354).

Documentation

  • [ ] User guide:
    • [x] Update the “Terminology” section: “Index” may include custom indexes, review “Dimension coordinate” / “Non-dimension coordinate” as “Indexed coordinate” / “Non-indexed coordinate”
    • [ ] Update the “Data structure” section such that it clearly mentions indexes as 1st class citizen of the Xarray data model
    • [ ] Maybe update other parts of the documentation that refer to the concept of “dimension coordinate”
  • [ ] API reference:
  • [ ] add Indexes API
  • [ ] add Index API: #6975
  • [ ] Xarray internals: add a subsection on how to add custom indexes, maybe with some basic examples: #6975
  • [ ] Update development roadmap section

Index types and helper classes built in Xarray

  • [ ] Since a lot of potential use-cases for custom indexes may consist in adding some extra logic on top of one or more pandas indexes along one or more dimensions (i.e., “meta-indexes”), it might be worth providing a helper Index abstract subclass that would basically dispatch the given arguments to the corresponding, encapsulated PandasIndex instances and then merge the results
  • 7182

  • [ ] Depreciate PandasMultiIndex dimension coordinate?

3rd party indexes

  • [ ] Add custom index entrypoint / plugin system, similarly to storage backend entrypoints
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6293/reactions",
    "total_count": 12,
    "+1": 6,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 6,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1879109770 PR_kwDOAMm_X85ZbILy 8140 Deprecate passing pd.MultiIndex implicitly benbovy 4160723 open 0     23 2023-09-03T14:01:18Z 2023-11-15T20:15:00Z   MEMBER   0 pydata/xarray/pulls/8140
  • Follow-up #8094
  • [x] Closes #6481
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst

This PR should normally raise a warning each time when indexed coordinates are created implicitly from a pd.MultiIndex object.

I updated the tests to create coordinates explicitly using Coordinates.from_pandas_multiindex().

I also refactored some parts where a pd.MultiIndex could still be passed and promoted internally, with the exception of:

  • swap_dims(): it should raise a warning! Right now the warning message is a bit confusing for this case, but instead of adding a special case we should probably deprecate the whole method? As it is suggested as a TODO comment... This method was to circumvent the limitations of dimension coordinates, which isn't needed anymore (rename_dims and/or set_xindex is equivalent and less confusing).
  • xr.DataArray(pandas_obj_with_multiindex, dims=...): I guess it should raise a warning too?
  • da.stack(z=...).groupby("z"): it shoudn't raise a warning, but this requires a (heavy?) refactoring of groupby. During building the "grouper" objects, grouper.group1d or grouper.unique_coord may still be built by extracting only the multi-index dimension coordinate. I'd greatly appreciate if anyone familiar with the groupby implementation could help me with this! @dcherian ?
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8140/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1865494976 PR_kwDOAMm_X85Ytlq0 8111 Alignment: allow flexible index coordinate order benbovy 4160723 open 0     3 2023-08-24T16:18:49Z 2023-09-28T15:58:38Z   MEMBER   0 pydata/xarray/pulls/8111
  • [x] Closes #7002
  • [x] Tests added
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst

This PR relaxes some of the rules used in alignment for finding the indexes to compare or join together. Those indexes must still be of the same type and must relate to the same set of coordinates (and dimensions), but the order of coordinates is now ignored.

It is up to the index to implement the equal / join logic if it needs to care about that order.

Regarding pandas.MultiIndex, it seems that the level names are ignored when comparing indexes:

```python midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=("one", "two"))) midx2 = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=("two", "one"))

midx.equals(midx2) # True ```

However, in Xarray the names of the multi-index levels (and their order) matter since each level has its own xarray coordinate. In this PR, PandasMultiIndex.equals() and PandasMultiIndex.join() thus check that the level names match.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8111/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1869879398 PR_kwDOAMm_X85Y8P4c 8118 Add Coordinates `set_xindex()` and `drop_indexes()` methods benbovy 4160723 open 0     0 2023-08-28T14:28:24Z 2023-09-19T01:53:18Z   MEMBER   0 pydata/xarray/pulls/8118
  • Complements #8102
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

I don't think that we need to copy most API from Dataset / DataArray to Coordinates, but I find it convenient to have some relevant methods there too. For example, building Coordinates from scratch (with custom indexes) before passing the whole coords + indexes bundle around:

```python import dask.array as da import numpy as np import xarray as xr

coords = ( xr.Coordinates( coords={"x": da.arange(100_000_000), "y": np.arange(100)}, indexes={}, ) .set_xindex("x", DaskIndex) .set_xindex("y", xr.indexes.PandasIndex) )

ds = xr.Dataset(coords=coords)

<xarray.Dataset>

Dimensions: (x: 100000000, y: 100)

Coordinates:

* x (x) int64 dask.array<chunksize=(16777216,), meta=np.ndarray>

* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 ... 90 91 92 93 94 95 96 97 98 99

Data variables:

empty

Indexes:

x DaskIndex

```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8118/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1890893841 I_kwDOAMm_X85wtMAR 8171 Fancy reprs benbovy 4160723 open 0     10 2023-09-11T16:46:43Z 2023-09-15T21:07:52Z   MEMBER      

What is your issue?

In Xarray we already have the plain-text and html reprs, which is great.

Recently, I've tried anywidget and I think that it has potential to overcome some of the limitations of the current repr and possibly go well beyond it.

The main advantages of anywidget:

  • it is broadly compatible with jupyter-like front-ends (Jupyterlab, notebook, vscode, colab, etc.), although I haven't tested it myself on all those front-ends yet.
  • it is super easy to get started: almost no project setup (build, packaging) is required before experimenting with it, although it still requires writing Javascript / HTML / CSS, etc..

I don't think we should replace the current html repr (it is still useful to have a basic, pure HTML/CSS version), but having a new widget could improve some aspects like not including the whole CSS each time an object repr is displayed, removing some HTML/CSS hacks... and actually has much more potential since we would have the whole javascript ecosystem at our fingertips (quick plots, etc.). Also bi-directional communication with Python is possible.

I'm opening this issue to brainstorm about what would be nice to have in widget-based Xarray reprs:

  • fancy hover effects (e.g., highlight all variables sharing common dimensions, coordinates sharing a common index, etc.)
  • more icons next to each variable reprs (attributes, array repr, quick plot? quick map?)
  • ... ?

cc @pydata/xarray

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8171/reactions",
    "total_count": 5,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 2,
    "eyes": 0
}
    xarray 13221727 issue
1889195671 I_kwDOAMm_X85wmtaX 8166 Dataset.from_dataframe: deprecate expanding the multi-index benbovy 4160723 open 0     3 2023-09-10T15:54:31Z 2023-09-11T06:20:50Z   MEMBER      

What is your issue?

Let's continue here the discussion about changing the behavior of Dataset.from_dataframe (see https://github.com/pydata/xarray/pull/8140#issuecomment-1712485626).

The current behaviour of Dataset.from_dataframe where it always unstacks feels wrong to me. To me, it seems sensible that Dataset.from_dataframe(df) automatically creates a Dataset with PandasMultiIndex if df has a MultiIndex. The user can then use that or quite easily unstack to a dense or sparse array.

If we don't unstack anymore the multi-index in Dataset.from_dataframe, are we OK that the "Dataset -> DataFrame -> Dataset" round-trip will not yield expected results unless we unstack explicitly?

```python ds = xr.Dataset( {"foo": (("x", "y"), [[1, 2], [3, 4]])}, coords={"x": ["a", "b"], "y": [1, 2]}, )

df = ds.to_dataframe() ds2 = xr.Dataset.from_dataframe(df, dim="z")

ds2.identical(ds) # False

ds2.unstack("z").identical(ds) # True ```

cc @max-sixty @dcherian

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8166/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1889751633 PR_kwDOAMm_X85Z-5v1 8170 Dataset.from_dataframe: optionally keep multi-index unexpanded benbovy 4160723 open 0     0 2023-09-11T06:20:17Z 2023-09-11T06:20:17Z   MEMBER   1 pydata/xarray/pulls/8170
  • [x] Closes #8166
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

I added both the unstack and dim arguments but we can change that.

  • [ ] update DataArray.from_series()
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8170/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1880184915 PR_kwDOAMm_X85ZespA 8143 Deprecate the multi-index dimension coordinate benbovy 4160723 open 0     0 2023-09-04T12:32:36Z 2023-09-04T12:32:48Z   MEMBER   0 pydata/xarray/pulls/8143
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst

This PR adds a future_no_mindex_dim_coord=False option that, if set to True, enables the future behavior of PandasMultiIndex (i.e., no added dimension coordinate with tuple values):

```python import xarray as xr

ds = xr.Dataset(coords={"x": ["a", "b"], "y": [1, 2]})

ds.stack(z=["x", "y"])

<xarray.Dataset>

Dimensions: (z: 4)

Coordinates:

* z (z) object MultiIndex

* x (z) <U1 'a' 'a' 'b' 'b'

* y (z) int64 1 2 1 2

Data variables:

empty

with xr.set_options(future_no_mindex_dim_coord=True): ds.stack(z=["x", "y"])

<xarray.Dataset>

Dimensions: (z: 4)

Coordinates:

* x (z) <U1 'a' 'a' 'b' 'b'

* y (z) int64 1 2 1 2

Dimensions without coordinates: z

Data variables:

empty

```

There are a few other things that we'll need to adapt or deprecate:

  • Dropping multi-index dimension coordinate de-facto allows having several multi-indexes along the same dimension. Normally stack should already take this into account, but there may be other places where this is not yet supported or where we should raise an explicit error.
  • Deprecate Dataset.reorder_levels: API is not compatible with the absence of dimension coordinate and several multi-indexes along the same dimension. I think it is OK to deprecate such edge case, which alternatively could be done by extracting the pandas index, updating it and then re-assign it to a the dataset with assign_coords(xr.Coordinates.from_pandas_multiindex(...))
  • The text-based repr: in the example above, Dimensions without coordinate: z doesn't make much sense
  • ... ?

I started updating the tests, although this will be much easier once #8140 is merged. This is something that we could also easily split into multiple PRs. It is probably OK if some features are (temporarily) breaking badly when setting future_no_mindex_dim_coord=True.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8143/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1874412700 PR_kwDOAMm_X85ZLe24 8124 More flexible index variables benbovy 4160723 open 0     0 2023-08-30T21:45:12Z 2023-08-31T16:02:20Z   MEMBER   1 pydata/xarray/pulls/8124
  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

The goal of this PR is to provide a more general solution to indexed coordinate variables, i.e., support arbitrary dimensions and/or duck arrays for those variables while at the same time prevent them from being updated in a way that would invalidate their index.

This would solve problems like the one mentioned here: https://github.com/pydata/xarray/issues/1650#issuecomment-1697237429

@shoyer I've tried to implement what you have suggested in https://github.com/pydata/xarray/pull/4979#discussion_r589798510. It would be nice indeed if eventually we could get rid of IndexVariable. It won't be easy to deprecate it until we finish the index refactor (i.e., all methods listed in #6293), though. Also, I didn't find an easy way to refactor that class as it has been designed too closely around a 1-d variable backed by a pandas.Index.

So the approach implemented in this PR is to keep using IndexVariable for PandasIndex until we can deprecate / remove it later, and for the other cases use Variable with data wrapped in a custom IndexedCoordinateArray object.

The latter solution (wrapper) doesn't always work nicely, though. For example, several methods of Variable expect that self._data directly returns a duck array (e.g., a dask array or a chunked duck array). A wrapped duck array will result in unexpected behavior there. We could probably add some checks / indirection or extend the wrapper API... But I wonder if there wouldn't be a more elegant approach?

More generally, which operations should we allow / forbid / skip for an indexed coordinate variable?

  • Set array items in-place? Do not allow.
  • Replace data? Do not allow.
  • (Re)Chunk?
  • Load lazy data?
  • ... ?

(Note: we could add Index.chunk() and Index.load() methods in order to allow an Xarray index implement custom logic for the two latter cases like, e.g., convert a DaskIndex to a PandasIndex during load, see #8128).

cc @andersy005 (some changes made here may conflict with what you are refactoring in #8075).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8124/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1875631817 PR_kwDOAMm_X85ZPnjq 8128 Add Index.load() and Index.chunk() methods benbovy 4160723 open 0     0 2023-08-31T14:16:27Z 2023-08-31T15:49:06Z   MEMBER   1 pydata/xarray/pulls/8128
  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

As mentioned in #8124, it gives more control to custom Xarray indexes on what best to do when the Dataset / DataArray load() and chunk() counterpart methods are called.

PandasIndex.load() and PandasIndex.chunk() always return self (no action required).

For a DaskIndex, we might want to return a PandasIndex (or another non-lazy index) from load() and rebuild a DaskIndex object from chunk() (rechunk).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8128/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1412901282 PR_kwDOAMm_X85A_96j 7182 add MultiPandasIndex helper class benbovy 4160723 open 0     2 2022-10-18T09:42:58Z 2023-08-23T16:30:28Z   MEMBER   1 pydata/xarray/pulls/7182
  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

This PR adds a xarray.indexes.MultiPandasIndex helper class for building custom, meta-indexes that encapsulate multiple PandasIndex instances. Unlike PandasMultiIndex, the meta-index classes inheriting from this helper class may encapsulate loosely coupled (pandas) indexes, with coordinates of arbitrary dimensions (each coordinate must be 1-dimensional but an Xarray index may be created from coordinates with differing dimensions).

Early prototype in this notebook

TODO / TO FIX:

  • How to allow custom __init__ options in subclasses be passed to all the type(self)(new_indexes) calls inside the MultiPandasIndex "base" class? This could be done via **kwargs passed through... However, mypy will certainly complain (Liskov Substitution Principle).
  • Is MultiPandasIndex a good name for this helper class?
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7182/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1364388790 I_kwDOAMm_X85RUuu2 7002 Custom indexes and coordinate (re)ordering benbovy 4160723 open 0     2 2022-09-07T09:44:12Z 2023-08-23T14:35:32Z   MEMBER      

What is your issue?

(From https://github.com/pydata/xarray/issues/5647#issuecomment-946546464).

The current alignment logic (as refactored in #5692) requires that two compatible indexes (i.e., of the same type) must relate to one or more coordinates with matching names but also in a matching order.

For some multi-coordinate indexes like PandasMultiIndex this makes sense. However, for other multi-coordinate indexes (e.g., staggered grid indexes) the order of the coordinates doesn't matter much.

Possible options:

  1. Setting new Xarray indexes may reorder the coordinate variables, possibly via Index.create_variables(), to ensure consistent order
  2. Xarray indexes must implement a Index.matching_key abstract property in order to support re-indexing and alignment.
  3. Take care of coordinate order (and maybe other things) inside Index.join and Index.equals, e.g., for PandasMultiIndex maybe reorder the levels beforehand.
    • pros: more flexible
    • cons: not great to implicitly reorder levels if it's a costly operation?
  4. Find matching indexes using a two-passes approach: (1) group all indexes by dimension name and (2) check compatibility between the indexes listed in each group.
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7002/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1812008663 I_kwDOAMm_X85sAQ7X 8002 Improve discoverability of index build options benbovy 4160723 open 0     2 2023-07-19T13:54:09Z 2023-07-19T17:48:51Z   MEMBER      

Is your feature request related to a problem?

Currently Dataset.set_xindex(coord_names, index_cls=None, **options) allows passing index build options (if any) via the **options arguments. Those options are not easily discoverable, though (no auto-completion, etc.).

Describe the solution you'd like

What about something like this?

```python ds.set_xindex("x", MyCustomIndex.with_options(foo=1, bar=True))

or

ds.set_xindex("x", *MyCustomIndex.with_options(foo=1, bar=True)) ```

This would require adding a .with_options() class method that can be overridden in Index subclasses (optional):

```python

xarray.core.indexes

class Index: @classmethod def with_options(cls) -> tuple[type[Self], dict[str, Any]]: return cls, {} ```

```python

third-party code

from xarray.indexes import Index

class MyCustomIndex(Index):

@classmethod
def with_options(cls, foo: int = 0, bar: bool = False) -> tuple[type[Self], dict[str, Any]]:
    """Set a new MyCustomIndex with options.

    Parameters
    ------------
    foo : int, optional
        The foo option (default: 1).
    bar : bool, optional
        The bar option (default: False).
    """
    return cls, {"foo": foo, "bar": bar}

```

Thoughts?

Describe alternatives you've considered

Build options are also likely defined in the Index constructor, e.g.,

```python

third-party code

from xarray.indexes import Index

class MyCustomIndex(Index):

def __init__(self, data, foo=0, bar=False):
    ...

```

However, the Index constructor is not public API (only used internally and indirectly in Xarray when setting a new index from existing coordinates).

Any other idea?

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8002/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1151751524 I_kwDOAMm_X85EplVk 6308 xr.doctor(): diagnostics on a Dataset / DataArray ? benbovy 4160723 open 0     4 2022-02-26T12:10:07Z 2022-11-07T15:28:35Z   MEMBER      

Is your feature request related to a problem?

Recently I've been reading through various issue reports here and there (GH issues and discussions, forums, etc.) and I'm wondering if it wouldn't be useful to have some function in Xarray that inspects a Dataset or DataArray and reports a bunch of diagnostics, so that the community could better help troubleshooting performance or other issues faced by users.

It's not always obvious where to look (e.g., number of chunks of a dask array, number of tasks of a dask graph, etc.) to diagnose issues, sometimes even for experienced users.

Describe the solution you'd like

A xr.doctor(dataset_or_dataarray) top-level function (or Dataset.doctor() / DataArray.doctor() methods) that would perform a battery of checks and return helpful diagnostics, e.g.,

  • "Data variable "x" wraps a dask array that contains a lot of tasks, which may affect performance"
  • "Data variable "x" wraps a dask array that contains many small chunks"
  • ... possibly many other diagnostics?

Describe alternatives you've considered

None

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6308/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1364798843 PR_kwDOAMm_X84-hLRI 7004 Rework PandasMultiIndex.sel internals benbovy 4160723 open 0     2 2022-09-07T14:57:29Z 2022-09-22T20:38:41Z   MEMBER   0 pydata/xarray/pulls/7004
  • [x] Closes #6838
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst

This PR hopefully improves how are handled the labels that are provided for multi-index level coordinates in .sel().

More specifically, slices are handled in a cleaner way and it is now allowed to provide array-like labels.

PandasMultiIndex.sel() relies on the underlying pandas.MultiIndex methods like this:

  • use get_loc when all levels are provided with each a scalar label (no slice, no array)
  • always drops the index and returns scalar coordinates for each multi-index level
  • use get_loc_level when only a subset of levels are provided with scalar labels only
  • may collapse one or more levels of the multi-index (dropped levels result in scalar coordinates)
  • if only one level remains: renames the dimension and the corresponding dimension coordinate
  • use get_locs for all other cases.
  • always keeps the multi-index and its coordinates (even if only one item or one level is selected)

This yields a predictable behavior: as soon as one of the provided labels is a slice or array-like, the multi-index and all its level coordinates are kept in the result.

Some cases illustrated below (I compare this PR with an older release due to the errors reported in #6838):

```python import xarray as xr import pandas as pd

midx = pd.MultiIndex.from_product([list("abc"), range(4)], names=("one", "two")) ds = xr.Dataset(coords={"x": midx})

<xarray.Dataset>

Dimensions: (x: 12)

Coordinates:

* x (x) object MultiIndex

* one (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c'

* two (x) int64 0 1 2 3 0 1 2 3 0 1 2 3

Data variables:

empty

```

```python ds.sel(one="a", two=0)

this PR

<xarray.Dataset>

Dimensions: ()

Coordinates:

x object ('a', 0)

one <U1 'a'

two int64 0

Data variables:

empty

v2022.3.0

<xarray.Dataset>

Dimensions: ()

Coordinates:

x object ('a', 0)

Data variables:

empty

```

```python ds.sel(one="a")

this PR:

<xarray.Dataset>

Dimensions: (two: 4)

Coordinates:

* two (two) int64 0 1 2 3

one <U1 'a'

Data variables:

empty

v2022.3.0

<xarray.Dataset>

Dimensions: (two: 4)

Coordinates:

* two (two) int64 0 1 2 3

Data variables:

empty

```

```python ds.sel(one=slice("a", "b"))

this PR

<xarray.Dataset>

Dimensions: (x: 8)

Coordinates:

* x (x) object MultiIndex

* one (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b'

* two (x) int64 0 1 2 3 0 1 2 3

Data variables:

empty

v2022.3.0

<xarray.Dataset>

Dimensions: (two: 8)

Coordinates:

* two (two) int64 0 1 2 3 0 1 2 3

Data variables:

empty

```

```python ds.sel(one="a", two=slice(1, 1))

this PR

<xarray.Dataset>

Dimensions: (x: 1)

Coordinates:

* x (x) object MultiIndex

* one (x) object 'a'

* two (x) int64 1

Data variables:

empty

v2022.3.0

<xarray.Dataset>

Dimensions: (x: 1)

Coordinates:

* x (x) MultiIndex

- one (x) object 'a'

- two (x) int64 1

Data variables:

empty

```

```python ds.sel(one=["b", "c"], two=[0, 2])

this PR

<xarray.Dataset>

Dimensions: (x: 4)

Coordinates:

* x (x) object MultiIndex

* one (x) object 'b' 'b' 'c' 'c'

* two (x) int64 0 2 0 2

Data variables:

empty

v2022.3.0

ValueError: Vectorized selection is not available along coordinate 'one' (multi-index level)

```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7004/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1325016510 I_kwDOAMm_X85O-iW- 6860 Align with join='override' may update index coordinate metadata benbovy 4160723 open 0     0 2022-08-01T21:45:13Z 2022-08-01T21:49:41Z   MEMBER      

What happened?

It seems that align(*, join="override") may have affected and still may affect the metadata of index coordinate data in an incorrect way. See the MCV example below.

cf. @keewis' original https://github.com/pydata/xarray/pull/6857#discussion_r934425142.

What did you expect to happen?

Index coordinate metadata unaffected by alignment (i.e., metadata is passed through object -> aligned object for each object), like for align with other join methods.

Minimal Complete Verifiable Example

```Python import xarray as xr

ds1 = xr.Dataset(coords={"x": ("x", [1, 2, 3], {"foo": 1})}) ds2 = xr.Dataset(coords={"x": ("x", [1, 2, 3], {"bar": 2})})

aligned1, aligned2 = xr.align(ds1, ds2, join="override")

aligned1.x.attrs

v2022.03.0 -> {'foo': 1}

v2022.06.0 -> {'foo': 1, 'bar': 2}

PR #6857 -> {'foo': 1}

expected -> {'foo': 1}

aligned2.x.attrs

v2022.03.0 -> {}

v2022.06.0 -> {'foo': 1, 'bar': 2}

PR #6857 -> {'foo': 1, 'bar': 2}

expected -> {'bar': 2}

aligned11, aligned22 = xr.align(ds1, ds2, join="inner")

aligned11.x.attrs

{'foo': 1}

aligned22.x.attrs

{'bar': 2}

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:36:15) [Clang 11.1.0 ] python-bits: 64 OS: Darwin OS-release: 20.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 0.21.2.dev137+g30023a484 pandas: 1.4.0 numpy: 1.22.2 scipy: 1.7.1 netCDF4: 1.5.8 pydap: installed h5netcdf: 0.11.0 h5py: 3.4.0 Nio: None zarr: 2.6.1 cftime: 1.5.2 nc_time_axis: 1.2.0 PseudoNetCDF: installed rasterio: 1.2.10 cfgrib: 0.9.8.5 iris: 3.0.4 bottleneck: 1.3.2 dask: 2022.01.1 distributed: 2022.01.1 matplotlib: 3.4.3 cartopy: 0.20.1 seaborn: 0.11.1 numbagg: 0.2.1 fsspec: 0.8.5 cupy: None pint: 0.16.1 sparse: 0.13.0 flox: None numpy_groupies: None setuptools: 57.4.0 pip: 20.2.4 conda: None pytest: 6.2.5 IPython: 7.27.0 sphinx: 3.3.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6860/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1005623261 I_kwDOAMm_X8478Jfd 5812 Check explicit indexes when comparing two xarray objects benbovy 4160723 open 0     2 2021-09-23T16:19:32Z 2021-09-24T15:59:02Z   MEMBER      

Is your feature request related to a problem? Please describe. With the explicit index refactor, two Dataset or DataArray objects a and b may have the same variables / coordinates and attributes but different indexes.

Describe the solution you'd like I'd suggest that a.identical(b) by default also checks for equality betweena.xindexes and b.xindexes.

One drawback is when we want to check either the attributes or the indexes but not both. Should we add options like suggested in #5733 then?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5812/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1006335177 I_kwDOAMm_X847-3TJ 5814 Confusing assertion message when comparing datasets with differing coordinates benbovy 4160723 open 0     1 2021-09-24T10:50:11Z 2021-09-24T15:17:00Z   MEMBER      

What happened: When two datasets a and b have only differing coordinates, xr.testing.assert_* may output a confusing message that also reports differing data variables (although strictly equal/identical) sharing common dimensions with those differing coordinates. I guess it is because when comparing the data variables we compare DataArray objects (thus including the coordinates).

What you expected to happen: An output assertion error message that shows only the differing coordinates.

Minimal Complete Verifiable Example:

```python

import xarray as xr a = xr.Dataset(data_vars={"var": ("x", [10.0, 11.0])}, coords={"x": [0, 1]}) b = xr.Dataset(data_vars={"var": ("x", [10.0, 11.0])}, coords={"x": [2, 3]}) xr.testing.assert_equal(a, b) AssertionError: Left and right Dataset objects are not equal

Differing coordinates: L * x (x) int64 0 1 R * x (x) int64 2 3 Differing data variables: L var (x) float64 10.0 11.0 R var (x) float64 10.0 11.0 ```

I would rather expect:

```python

xr.testing.assert_equal(a, b) AssertionError: Left and right Dataset objects are not equal

Differing coordinates: L * x (x) int64 0 1 R * x (x) int64 2 3 ```

Anything else we need to know?:

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:36:15) [Clang 11.1.0 ] python-bits: 64 OS: Darwin OS-release: 20.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.19.1.dev72+ga8d84c703.d20210901 pandas: 1.3.2 numpy: 1.21.2 scipy: 1.7.1 netCDF4: 1.5.6 pydap: installed h5netcdf: 0.8.1 h5py: 3.3.0 Nio: None zarr: 2.6.1 cftime: 1.5.0 nc_time_axis: 1.2.0 PseudoNetCDF: installed rasterio: 1.2.1 cfgrib: 0.9.8.5 iris: 3.0.4 bottleneck: 1.3.2 dask: 2021.01.1 distributed: 2021.01.1 matplotlib: 3.4.3 cartopy: 0.18.0 seaborn: 0.11.1 numbagg: None fsspec: 0.8.5 cupy: None pint: 0.16.1 sparse: 0.11.2 setuptools: 57.4.0 pip: 20.2.4 conda: None pytest: 6.2.5 IPython: 7.27.0 sphinx: 3.3.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5814/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
977149831 MDU6SXNzdWU5NzcxNDk4MzE= 5732 Coordinates implicitly created when passing a DataArray as coord to Dataset constructor benbovy 4160723 open 0     3 2021-08-23T15:20:37Z 2021-08-24T14:18:09Z   MEMBER      

I stumbled on this while working on #5692. Is this intended behavior or unwanted side effect?

What happened:

Create a new Dataset by passing a DataArray object as coordinate also add the DataArray coordinates to the dataset:

```python

foo = xr.DataArray([1.0, 2.0, 3.0], coords={"x": [0, 1, 2]}, dims="x") ds = xr.Dataset(coords={"foo": foo}) ds <xarray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 foo (x) float64 1.0 2.0 3.0 Data variables: empty ```

What you expected to happen:

The behavior above seems a bit counter-intuitive to me. I would rather expect no additional coordinates auto-magically added to the dataset, i.e. only one foo coordinate in this example:

```python

ds <xarray.Dataset> Dimensions: (x: 3) Coordinates: foo (x) float64 1.0 2.0 3.0 Data variables: empty ```

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Nov 27 2020, 19:17:44) [Clang 11.0.0 ] python-bits: 64 OS: Darwin OS-release: 20.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.19.0 pandas: 1.1.5 numpy: 1.21.1 scipy: 1.7.0 netCDF4: 1.5.5.1 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.3.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.07.2 distributed: 2021.07.2 matplotlib: 3.3.3 cartopy: 0.19.0.post1 seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20201009 pip: 20.3.1 conda: None pytest: 6.1.2 IPython: 7.25.0 sphinx: 3.3.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5732/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
902009258 MDU6SXNzdWU5MDIwMDkyNTg= 5376 Multi-scale datasets and custom indexes benbovy 4160723 open 0     6 2021-05-26T08:38:00Z 2021-06-02T08:07:38Z   MEMBER      

I've been wondering if:

  • multi-scale datasets are generic enough to implement some related functionality in Xarray, e.g., as new Dataset and/or DataArray method(s)
  • we could leverage custom indexes for that (see the design notes)

I'm thinking of an API that would look like this:

```python

lazily load a big n-d image (full resolution) as a xarray.Dataset

xyz_dataset = ...

set a new index for the x/y/z coordinates

(reduction and pre_compute_scales are optional and passed

as arguments to ImagePyramidIndex)

xyz_dataset.set_index( ('x', 'y', 'z'), ImagePyramidIndex, reduction=np.mean, pre_compute_scales=(2, 2), )

get a slice (ImagePyramidIndex will be used to dynamically scale the data

or load the right pre-computed dataset)

xyz_slice = xyz_dataset.sel_and_rescale(x=slice(...), y=slice(...), z=slice(...)) ```

where ImagePyramidIndex is not a "common" index, i.e., it cannot be used directly with Xarray's .sel() nor for data alignment. Using an index here might still make sense for such data extraction and resampling operation IMHO. We could extend the xarray.Index API to handle multi-scale datasets, so that ImagePyramidIndex could either do the scaling dynamically (maybe using a cache) or just lazily load pre-computed data, e.g., from a NGFF / OME-Zarr dataset... Both the implementation and functionality can be pretty flexible. Custom options may be passed through the Xarray API either when creating the index or when extracting a data slice.

A hierarchical structure of xarray.Dataset objects is already discussed in #4118 for multi-scale datasets, but I'm wondering if using indexes could be an alternative approach (it could also be complementary, i.e., ImagePyramidIndex could rely on such hierarchical structure under the hood).

I'd see some advantages of the index approach, although this is the perspective from a naive user who is not working with multi-scale datasets:

  • it is flexible: the scaling may be done dynamically without having to store the results in a hierarchical collection with some predefined discrete levels
  • we don't need to expose anything other than a simple xarray.Dataset + a "black-box" index in which we abstract away all the implementation details. The API example shown above seems more intuitive to me than having to deal directly with Dataset groups.
  • Xarray will provide a plugin system for 3rd party indexes, allowing for more ImagePyramidIndex variants. Xarray already provides an extension mechanism (accessors) for methods like sel_and_rescale in the example above...

That said, I'd also see the benefits of exposing Dataset groups more transparently to users (in case those are loaded from a store that supports it).

cc @thewtex @joshmoore @d-v-b

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5376/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
    xarray 13221727 issue
869721207 MDU6SXNzdWU4Njk3MjEyMDc= 5226 Attributes encoding compatibility between backends benbovy 4160723 open 0     1 2021-04-28T09:11:19Z 2021-04-28T15:42:42Z   MEMBER      

What happened:

Let's create an Zarr dataset with some "less common" dtype and fill value, open it with Xarray and save the dataset as NetCDF:

```python import xarray as xr import zarr

g = zarr.group() g.create('arr', shape=3, fill_value='z', dtype='<U1') g['arr'].attrs['_ARRAY_DIMENSIONS'] = ('dim_1')

-- without masking fill values

ds = xr.open_zarr(g.store, mask_and_scale=False)

ds.arr.attrs # returns {'_FillValue': 'z'}

error: netCDF4 does not yet support setting a fill value for variable-length strings

ds.to_netcdf('test.nc')

-- with masking fill values

ds2 = xr.open_zarr(g.store, mask_and_scale=True)

returns a dict that includes item _FillValue': 'z'

ds2.arr.encoding

same error than above

ds2.to_netcdf('out2.nc') ```

What you expected to happen:

Seamless conversion (read/write) from one backend to another. Is there anything we could do to improve the case shown here above, and maybe other cases like the one described in #5223?

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None libhdf5: None libnetcdf: None xarray: 0.17.0 pandas: 1.0.3 numpy: 1.18.1 scipy: 1.3.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.8.1 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.11.0 distributed: 2.14.0 matplotlib: 3.1.1 cartopy: None seaborn: None numbagg: None pint: None setuptools: 46.1.3.post20200325 pip: 19.2.3 conda: None pytest: 5.4.1 IPython: 7.13.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5226/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
733077617 MDU6SXNzdWU3MzMwNzc2MTc= 4555 Vectorized indexing (isel) of chunked data with 1D indices gives weird chunks benbovy 4160723 open 0     1 2020-10-30T10:55:33Z 2021-03-02T17:36:48Z   MEMBER      

What happened:

Applying .isel() on a DataArray or Dataset with chunked data using 1-d indices (either stored in a xarray.Variable or a numpy.ndarray) gives weird chunks (i.e., a lot of chunks with small sizes).

What you expected to happen:

More consistent chunk sizes.

Minimal Complete Verifiable Example:

Let's create a chunked DataArray

```python In [1]: import numpy as np

In [2]: import xarray as xr

In [3]: da = xr.DataArray(np.random.rand(100), dims='points').chunk(50)

In [4]: da Out[4]: <xarray.DataArray (points: 100)> dask.array<xarray-\<this-array>, shape=(100,), dtype=float64, chunksize=(50,), chunktype=numpy.ndarray> Dimensions without coordinates: points ```

Select random indices results in a lot of small chunks

```python In [5]: indices = xr.Variable('nodes', np.random.choice(np.arange(100, dtype='int'), size=10))

In [6]: da_sel = da.isel(points=indices)

In [7]: da_sel.chunks Out[7]: ((1, 1, 3, 1, 1, 3),) ```

What I would expect

python In [8]: da.data.vindex[indices.data].chunks Out[8]: ((10,),)

This works fine with 2+ dimensional indexers, e.g.,

```python In [9]: indices_2d = xr.Variable(('x', 'y'), np.random.choice(np.arange(100), size=(10, 10)))

In [10]: da_sel_2d = da.isel(points=indices_2d)

In [11]: da_sel_2d.chunks Out[11]: ((10,), (10,)) ```

Anything else we need to know?:

I suspect the issue is here:

https://github.com/pydata/xarray/blob/063606b90946d869e90a6273e2e18ed24bffb052/xarray/core/variable.py#L616-L617

In the example above I think we still want vectorized indexing (i.e., call dask.array.Array.vindex[] instead of dask.array.Array[]).

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.3 | packaged by conda-forge | (default, Jun 1 2020, 17:21:09) [Clang 9.0.1 ] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.UTF-8 libhdf5: None libnetcdf: None xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.1 scipy: 1.5.2 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.19.0 distributed: 2.25.0 matplotlib: 3.3.1 cartopy: None seaborn: None numbagg: None pint: None setuptools: 47.3.1.post20200616 pip: 20.1.1 conda: None pytest: 5.4.3 IPython: 7.16.1 sphinx: 3.2.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4555/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
187873247 MDU6SXNzdWUxODc4NzMyNDc= 1094 Supporting out-of-core computation/indexing for very large indexes benbovy 4160723 open 0     5 2016-11-08T00:56:56Z 2021-01-26T20:09:12Z   MEMBER      

(Follow-up of discussion here https://github.com/pydata/xarray/pull/1024#issuecomment-258524115).

xarray + dask.array successfully enable out-of-core computation for very large variables that doesn't fit in memory. One current limitation is that the indexes of a Dataset or DataArray, which rely on pandas.Index, are still fully loaded into memory (it will be soon loaded eagerly after #1024). In many cases this is not a problem, as the sizes of 1-dimensional indexes are usually much smaller than the sizes of n-dimensional variables or coordinates.

However, this may be problematic in some specific cases where we have to deal with very large indexes. As an example, big unstructured meshes often have coordinates (x, y, z) arranged as 1-d arrays of length that equals the number of nodes, which can be very large!! (See, e.g., ugrid conventions).

It would be very nice if xarray could also help for these use cases. Therefore I'm wondering if (and how) out-of-core support can be extended to indexes and indexing.

I've briefly looked at the documentation on dask.dataframe, and a first naive approach I have in mind would be to allow partitioning an index into multiple, contiguous indexes. For label-based indexing, we might for example map indexing.convert_label_indexer to each partition and combine the returned indexers.

My knowledge of dask is very limited, though. So I've no doubt that this suggestion is very simplistic and not very efficient, or that there are better approaches. I'm also certainly missing other issues not directly related to indexing.

Any thoughts?

cc @shoyer @mrocklin

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1094/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 112.749ms · About: xarray-datasette