home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1175329407

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1175329407 I_kwDOAMm_X85GDhp_ 6392 Pass indexes to the Dataset and DataArray constructors 4160723 closed 0     6 2022-03-21T12:41:51Z 2023-07-21T20:40:05Z 2023-07-21T20:40:04Z MEMBER      

Is your feature request related to a problem?

This is part of #6293 (explicit indexes next steps).

Describe the solution you'd like

A Mapping[Hashable, Index] would probably be the most obvious (optional) value type accepted for the indexes argument of the Dataset and DataArray constructors.

pros:

  • consistent with the xindexes property

cons:

  • need to be careful with what is passed as coords and indexes
  • multi-indexes: redundancy and order matters (e.g., pandas multi-index levels)

An example with a pandas multi-index

Currently a pandas multi-index may be passed directly as one (dimension) coordinate ; it is then "unpacked" into one dimension (tuple values) coordinate and one or more level coordinates. I would suggest depreciating this behavior in favor of a more explicit (although more verbose) way to pass an existing pandas multi-index:

```python import pandas as pd import xarray as xr

pd_idx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("foo", "bar")) idx = xr.PandasMultiIndex(pd_idx, "x")

indexes = {"x": idx, "foo": idx, "bar": idx} coords = idx.create_variables()

ds = xr.Dataset(coords=coords, indexes=indexes) ```

The cases below should raise an error:

```python ds = xr.Dataset(indexes=indexes)

ValueError: missing coordinate(s) for index(es): 'x', 'foo', 'bar'

ds = xr.Dataset( coords=coords, indexes={"x": idx, "foo": idx}, )

ValueError: missing index(es) for coordinate(s): 'bar'

ds = xr.Dataset( coords={"x": coords["x"], "foo": [0, 1, 2, 3], "bar": coords["bar"]}, indexes=indexes, )

ValueError: conflict between coordinate(s) and index(es): 'foo'

ds = xr.Dataset( coords=coords, indexes={"x": idx, "foo": idx, "bar": xr.PandasIndex([0, 1, 2], "y")}, )

ValueError: conflict between coordinate(s) and index(es): 'bar'

```

Should we raise an error or simply ignore the index in the case below?

```python ds = xr.Dataset(coords=coords)

ValueError: missing index(es) for coordinate(s): 'x', 'foo', 'bar'

or

create unindexed coordinates 'foo' and 'bar' and a 'x' coordinate with a single pandas index

```

Should we silently reorder the coordinates and/or indexes when the levels are not passed in the right order? It seems odd requiring mapping elements be passed in a given order.

```python ds = xr.Dataset(coords=coords, indexes={"bar": idx, "x": idx, "foo": idx}) list(ds.xindexes.keys())

["x", "foo", "bar"]

```

How to generalize to any (custom) index?

With the case of multi-index, it is pretty easy to check whether the coordinates and indexes are consistent because we ensure consistent pd_idx.names vs. coordinate names and because idx.get_variables() returns Xarray IndexVariable objects where variable data wraps the pandas multi-index.

However, this may not be easy for other indexes. Some Xarray custom indexes (like a KD-Tree index) likely won't return anything from .get_variables() as they don't support wrapping internal data as coordinate data. Right now there's nothing in the Xarray Index base class that could help checking consistency between indexes vs. coordinates for any kind of index.

How could we solve this?

  • A. add a .coords property to the Xarray Index base class, that returns a dict[Hashable, IndexVariable].

    • Ambiguous when an Index is created directly, i.e., like above xr.PandasMultiIndex(pd_idx, "x"). Should .coords return None and return the coordinates returned by the last .get_variables() call?
    • What if different sets of coordinates refer to a common index (e.g., after copying the coordinate variables, etc.)?
  • B. add a .coord_names property to the Xarray Index base class that returns tuple[Hashable, ...], and add a private attribute to IndexVariable that returns the index object (or return it via a very lightweight IndexAdapter base class used to wrap variable data).

    • Index.get_variables(variables) would by default return shallow copies of the input variables with a reference to the index object.
    • If that's necessary, we could also store the coordinate dimensions in coord_names, i.e., using tuple[tuple[Hashable, tuple[Hashable, ...]], ...].

I think I prefer the second option.

Describe alternatives you've considered

Also allow passing index types (and build options) via indexes

I.e., Mapping[Hashable, Index | Type[Index] | tuple[TypeIndex, Mapping[Any, Any]]], so that new indexes can be created from the passed coordinates at DataArray or Dataset creation.

pros:

  • Flexible.

cons:

  • This is complicated. Constructing the Dataset / DataArray (with default indexes) first then calling .set_index is probably better.
  • Hard to deal with multi-index (redundancy of build option, etc.)

Pass multi-indexes once, grouped by coordinate names

I.e., indexes keys accept tuples: Mapping[Hashable | tuple[Hashable, ...], Index]

pros:

  • No redundancy and easier to check consistency between indexes vs. coordinates

cons:

  • Not consistent with the .xindexes property
  • Complicated when eventually using tuples for coordinate names?

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6392/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 6 rows from issue in issue_comments
Powered by Datasette · Queries took 0.698ms · About: xarray-datasette