home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

11 rows where author_association = "MEMBER", issue = 1422543378 and user = 4160723 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • benbovy · 11 ✖

issue 1

  • Pass indexes directly to the DataArray and Dataset constructors · 11 ✖

author_association 1

  • MEMBER · 11 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1297046405 https://github.com/pydata/xarray/pull/7214#issuecomment-1297046405 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85NT1uF benbovy 4160723 2022-10-31T12:54:50Z 2022-10-31T12:54:50Z MEMBER

Thanks for the suggestion @shoyer, in general I like it very much! "Coordinates possibly baked by one or more indexes" feels much more natural than "indexes and their corresponding coordinates". Even though indexes have been promoted as 1st class citizens in the data model, their right place should still be in the background compared to coordinates. So having a Coordinates object that encapsulates the indexes makes a lot of sense to me.

My main concern is about the timing, as such a broader refactor might postpone some work in progress on the public API and the documentation. Ideally this shouldn't discourage users to start experimenting with custom indexes and building an ecosystem around it, as soon as possible.

There might be a fast path towards your suggestion, at least regarding the public facing API (your points 1 and 4):

  • Keep "private" the constructor of Indexes and keep it immutable.
  • Add a new IndexedCoordinates(Coordinates) class. Unlike DatasetCoordinates and DataArrayCoordinates, it would have a public constructor and/or alternative class methods (e.g., .from_pandas_multi_index() suggested by @dcherian)
  • In general, passing any Coordinates object to coords would assign both the coordinates and the indexes.

This would let us the possibility to achieve a broader (mostly internal) refactor of Indexes and Coordinates objects later without the risk of introducing too much breaking changes.

Alternatively, we could just wait for that refactor to finish before implementing explicit assignment of coordinates and indexes. We already have .set_xindex() and .drop_indexes() that are relevant and we could wait before deprecating xr.Dataset(coords={"x": pandas_midx}). Not sure when such big refactor will happen, though, the wait could be long.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1294783661 https://github.com/pydata/xarray/pull/7214#issuecomment-1294783661 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85NLNSt benbovy 4160723 2022-10-28T09:49:02Z 2022-10-28T09:49:02Z MEMBER

not necessarily do consistency checks (beyond verifying that the coordinate variables exist).

I'd just want to add that, from my experience with debugging multi-index issues, it is hard even for advanced users to see what's going wrong when coordinates and indexes are not consistent.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1294771427 https://github.com/pydata/xarray/pull/7214#issuecomment-1294771427 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85NLKTj benbovy 4160723 2022-10-28T09:38:22Z 2022-10-28T09:38:22Z MEMBER

Maybe a more generic Indexes class method that could be reused by 3rd-party indexes too? E.g., via some kind of hook or entrypoint...

An Indexes accessor? Or this is going too far? 🙂

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1293946521 https://github.com/pydata/xarray/pull/7214#issuecomment-1293946521 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85NIA6Z benbovy 4160723 2022-10-27T19:04:19Z 2022-10-27T19:52:21Z MEMBER

Explicitly providing indexes is an advanced user feature.

Agreed. However, xr.Dataset(coords={"x": pandas_midx}) is something that presumably a lot of users rely on (it is used extensively in Xarray's tests) and that we should really deprecate IMO. If we don't provide a convenient alternative, I expect many of those users will complain.

it's easier to explicitly manipulate indexes in the form of a dict

While generally I also prefer handling plain dict objects over custom dict-like objects, here I don't see much reasons of manipulating Xarray index objects independently of their coordinate variables. Indexes allows keeping them tied together, and it is already returned by .xindexes.

EDIT -- For more context: initially an Indexes object was almost equivalent to a Frozen(obj._indexes). In #5692 I tried hard and struggled to keep dealing with separate dicts of indexes and indexed variables, but in the end it made things much easier to encapsulate the variables in Indexes, which is also used internally in different places.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1293902008 https://github.com/pydata/xarray/pull/7214#issuecomment-1293902008 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85NH2C4 benbovy 4160723 2022-10-27T18:21:02Z 2022-10-27T18:21:02Z MEMBER

How about Indexes.from_pandas_multi_index() classmethod?

Yes that would make sense. However, it would be adding another pandas.MultiIndex special case while we'd like to remove them in Xarray. Maybe a more generic Indexes class method that could be reused by 3rd-party indexes too? E.g., via some kind of hook or entrypoint... The tricky thing is that arguments would probably differ much from one index type to another.

  1. does indexes get merged with existing ._indexes?

Indexes are not merged together but the new / replaced coordinate variables must be compatible with the other variables of the dataset. Dataset.assign_indexes(indexes) is actually implemented like this:

python def assign_indexes(self, indexes: Indexes[Index]): ds_indexes = Dataset(indexes=indexes) return ( self # prepare drop-in index / coordinate replacement .drop_vars(indexes, errors="ignore") # ensure the new indexes / coordinates are compatible with the Dataset .merge( ds_indexes, compat="minimal", # probably not the right option? join="override", # fastest option? (no real effect because of `drop_vars`) combine_attrs="no_conflicts", ) )

  1. Can we extract enough information from Index to have xr.merge(Indexes) -> Indexes work?

That is actually a good idea for https://github.com/pydata/xarray/pull/7214#issuecomment-1292089179! Not sure I would reuse xr.merge() for this as it would make the API messy, but why not an xr.merge_indexes() top-level function or an Indexes.merge() method?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1293531607 https://github.com/pydata/xarray/pull/7214#issuecomment-1293531607 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85NGbnX benbovy 4160723 2022-10-27T13:31:24Z 2022-10-27T13:42:44Z MEMBER

I also added an .assign_indexes() method that may be quite convenient. Like for the constructors, it only accepts an Indexes instance.

```python ds = xr.Dataset(coords={"x": [4, 5, 6, 7]}) ds2 = xr.Dataset(coords={"x": [1, 2, 3, 4]})

ds.assign_indexes(ds2.xindexes)

<xarray.Dataset>

Dimensions: (x: 4)

Coordinates:

* x (x) int64 1 2 3 4

Data variables:

empty

midx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two")) indexes = wrap_pandas_multiindex(midx, "x")

ds.assign_indexes(indexes)

<xarray.Dataset>

Dimensions: (x: 4)

Coordinates:

* x (x) object MultiIndex

* one (x) object 'a' 'a' 'b' 'b'

* two (x) int64 1 2 1 2

Data variables:

empty

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1293545325 https://github.com/pydata/xarray/pull/7214#issuecomment-1293545325 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85NGe9t benbovy 4160723 2022-10-27T13:41:50Z 2022-10-27T13:41:50Z MEMBER

@pydata/xarray I'd be very happy if you could share your thoughts about the examples shown in the last three comments. If you think the API looks good like that, then I will work on adding some tests and on the documentation.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1292089179 https://github.com/pydata/xarray/pull/7214#issuecomment-1292089179 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85NA7db benbovy 4160723 2022-10-26T13:54:22Z 2022-10-26T13:54:22Z MEMBER

Passing multiple indexes:

```python midx1 = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two")) midx2 = pd.MultiIndex.from_product([["c", "d"], [3, 4]], names=("three", "four"))

indexes1 = wrap_pandas_multiindex(midx1, "x") indexes2 = wrap_pandas_multiindex(midx2, "y")

indexes = Indexes( indexes=dict(indexes1, indexes2), variables=dict(indexes1.variables, indexes2.variables) )

ds = xr.Dataset(indexes=indexes)

<xarray.Dataset>

Dimensions: (x: 4, y: 4)

Coordinates:

* x (x) object MultiIndex

* one (x) object 'a' 'a' 'b' 'b'

* two (x) int64 1 2 1 2

* y (y) object MultiIndex

* three (y) object 'c' 'c' 'd' 'd'

* four (y) int64 3 4 3 4

Data variables:

empty

```

That's not looking super nice, but probably we can add some convenience function or Indexes method.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1291911349 https://github.com/pydata/xarray/pull/7214#issuecomment-1291911349 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85NAQC1 benbovy 4160723 2022-10-26T11:47:57Z 2022-10-26T12:14:23Z MEMBER

I implemented option 3. We can still change or revert it later if it's not the best one.

A few examples:

```python import pandas as pd import xarray as xr from xarray.indexes import wrap_pandas_multiindex

midx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two")) ```

It is now possible to pass a pandas multi-index to a Dataset like this:

```python

this returns an Indexes object (indexes + coordinates)

indexes = wrap_pandas_multiindex(midx, "x")

ds = xr.Dataset(indexes=indexes)

<xarray.Dataset>

Dimensions: (x: 4)

Coordinates:

* x (x) object MultiIndex

* one (x) object 'a' 'a' 'b' 'b'

* two (x) int64 1 2 1 2

Data variables:

empty

```

IMO the above should be preferred over passing it as a coordinate (should we deprecate it now?):

```python ds_deprecated = xr.Dataset(coords={"x": midx})

ds_deprecated.identical(ds)

True

eventually this would behave like this:

ds_midx_as_array = xr.Dataset(coords={"x": midx})

<xarray.Dataset>

Dimensions: (x: 4)

Coordinates:

* x (x) object ('a', 1) ('a', 2) ('b', 1) ('b', 2)

Data variables:

empty

```

We can pass indexes around from one Xarray object to another, e.g.,

```python da = xr.DataArray([1, 2, 3, 4], dims="x", indexes=ds.xindexes)

<xarray.DataArray (x: 4)>

array([1, 2, 3, 4])

Coordinates:

* x (x) object MultiIndex

* one (x) object 'a' 'a' 'b' 'b'

* two (x) int64 1 2 1 2

```

Skip creating pandas indexes for dimension coordinates:

```python ds_noindex = xr.Dataset(coords={"x": [0, 1, 2]}, indexes={})

<xarray.Dataset>

Dimensions: (x: 3)

Coordinates:

x (x) int64 0 1 2

Data variables:

empty

ds_noindex.xindexes

Indexes:

empty

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1291638319 https://github.com/pydata/xarray/pull/7214#issuecomment-1291638319 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85M_NYv benbovy 4160723 2022-10-26T07:52:35Z 2022-10-26T07:52:35Z MEMBER

For passing multiple indexes at once we could probably expand the Indexes API, e.g., with an .update() method.

Maybe with something else than .update() (let's keep Indexes an immutable collection?)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1291059643 https://github.com/pydata/xarray/pull/7214#issuecomment-1291059643 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85M9AG7 benbovy 4160723 2022-10-25T19:50:57Z 2022-10-25T19:50:57Z MEMBER

Hmm I'm wondering what would be best between the options below regarding the types for the indexes argument:

  1. Indexes[Index] | Sequence[Indexes[Index] | None
  2. Indexes[Index] | None
  3. Mapping[Any, Index] | None
  4. Any other suggestion?

Option 1 is nice for passing multiple indexes, e.g.,

```python pd_midx1 = pd.MultiIndex.from_arrays(..., names=("one", "two")) pd_midx2 = pd.MultiIndex.from_arrays(..., , names=("three", "four"))

indexes1 = PandasMultiIndex.from_pandas_index(pd_midx1, "x") indexes2 = PandasMultiIndex.from_pandas_index(pd_midx2, "y")

ds = xr.Dataset(indexes=[indexes1, indexes2]) ```

With option 1 it feels odd passing an empty list in order to avoid creating default indexes: ds = xr.Dataset(indexes=[]). Not really better in this regard with option 2: ds = xr.Dataset(indexes=Indexes()). Option 3 is better IMO: ds = xr.Dataset(indexes={}).

Option 3 actually works in all cases since Indexes[Index] is a sub-type of Mapping[Any, Index]. However, it is not clear from this generic type that any non-empty mapping must be an instance of Indexes (because the latter also contains the coordinate variables).

I'm leaning towards option 3. For passing multiple indexes at once we could probably expand the Indexes API, e.g., with an .update() method.

What do people think?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 32.155ms · About: xarray-datasette