home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

16 rows where issue = 1485037066 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • benbovy 12
  • shoyer 3
  • headtr1ck 1

author_association 2

  • MEMBER 15
  • COLLABORATOR 1

issue 1

  • Expose "Coordinates" as part of Xarray's public API · 16 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1464927490 https://github.com/pydata/xarray/pull/7368#issuecomment-1464927490 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85XUQUC headtr1ck 43316012 2023-03-11T14:50:02Z 2023-03-11T14:50:24Z COLLABORATOR

DataAlignable = TypeVar("DataAlignable", bound=DataWithCoords | Coordinates) -> doesn't work since we cannot mix DataWithCoords and Coordinates when aligning each object (input type = output type)

I think thats exactly correct. The whold idea of using a TypeVar for the inputs means that all inputs must have the same type.

Consider the following example: ```python class A: ... class B(A): ... class C: ...

T = TypeVar("T", bound=Union[A, C]) def f(*x: T) -> T: ...

f(A(), A()) # OK f(A(), B()) # OK f(C(), C()) # OK f(A(), C()) # not ok, because the common type of A and C is object ``` Also: sorry for the late reply.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1382070832 https://github.com/pydata/xarray/pull/7368#issuecomment-1382070832 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85SYLow benbovy 4160723 2023-01-13T16:13:16Z 2023-01-13T16:13:16Z MEMBER

Thanks for the review @shoyer. I addressed your comments.

Everything seems OK except a rather annoying mypy error that I'm struggling with:

The DataAlignable type variable should now encompass both DataWithCoords and Coordinates, since in this PR we add alignment support for the latter. I somewhat naively tried the options below without success:

  • DataAlignable = TypeVar("DataAlignable", bound=DataWithCoords | Coordinates) -> doesn't work since we cannot mix DataWithCoords and Coordinates when aligning each object (input type = output type)
  • DataAlignable = TypeVar("DataAlignable", bound=DataWithCoords, Coordinates) -> doesn't work with subclasses
  • DataAlignable = TypeVar("DataAlignable", Dataset, DataArray, Coordinates) -> doesn't work with generic types T_Dataset, etc.?
  • I even tried using a Protocol

@headtr1ck @Illviljan any idea?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1359003371 https://github.com/pydata/xarray/pull/7368#issuecomment-1359003371 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85RAL7r benbovy 4160723 2022-12-20T08:34:06Z 2022-12-20T08:34:06Z MEMBER

I'm wondering if instead of Coordinates.from_pandas_multiindex() we might want to provide a more generic constructor available as an extension point? For example:

Coordinates.from_index(index_obj: Any, *, factory=None, **kwargs=None)

factory could be guessed from the type of index_obj. Xarray would support by default the pandas.MultiIndex and pandas.Index types. Like for IO backends, we could provide a CoordinatesFactoryEntrypoint so that it could support other index types.

One downside is that specific (mandatory?) options like dim for a pandas (multi-)index are not directly visible.

Would it be useful or is it overkill?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1352874809 https://github.com/pydata/xarray/pull/7368#issuecomment-1352874809 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85Qozs5 benbovy 4160723 2022-12-15T10:42:59Z 2022-12-15T10:42:59Z MEMBER

OK this is now ready for review (cc @shoyer).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1352818155 https://github.com/pydata/xarray/pull/7368#issuecomment-1352818155 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85Qol3r benbovy 4160723 2022-12-15T09:59:03Z 2022-12-15T09:59:03Z MEMBER

Maybe there's some way to optimize that? I don't know if we can completely avoid it with the solution implemented in this PR, though. Promoting Coordinates is pretty clean and future proof IMO (assuming that we'll further refactor Coordinates to actually store variables and indexes, i.e., not as a proxy anymore). Is the (minor? temporary?) regression in performance acceptable and can we just leave it like that for now?

Fixed in 193dad3 (with some reasonable special case added in merge_core).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1352310432 https://github.com/pydata/xarray/pull/7368#issuecomment-1352310432 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85Qmp6g benbovy 4160723 2022-12-14T22:33:23Z 2022-12-15T01:08:41Z MEMBER

I did some profiling to find the cause of the decrease in performance reported in the benchmarks (dataset creation). In summary, this is explained by a Coordinates object (built from the coords mapping) that is now included in objects to align when merging data vars and coordinates. Previously all non DataArray objects in the coords mapping were excluded from alignment (in deep_align). The introduced overhead comes from a call to Coordinates._reindex_callback(), which (I think?) should do no more than shallow copies and/or xarray wrapping stuff. In the benchmark report this is only marked as significant when creating small datasets (1.5-2x slower), and it becomes insignificant for datasets with more data variables.

Maybe there's some way to optimize that? I don't know if we can completely avoid it with the solution implemented in this PR, though. Promoting Coordinates is pretty clean and future proof IMO (assuming that we'll further refactor Coordinates to actually store variables and indexes, i.e., not as a proxy anymore). Is the (minor? temporary?) regression in performance acceptable and can we just leave it like that for now?

More details about the new workflow implemented in this PR when creating a new Dataset:

  • if Dataset's coords argument is a "simple" mapping, it is first internally converted into a Coordinates object, with the creation of default indexes for dimension coordinates
  • if one or more DataArray objects are given in coords, their coordinates (variables + indexes) are extracted and merged with the other input coordinates
  • see the implementation in xarray.core.coordinates.create_coords_with_default_indexes
  • otherwise, just reuse the Coordinates object passed as coords
  • coordinates are then merged with data variables
  • the Coordinates object is aligned with every other "alignable" object found in data_vars
  • coordinate indexes (if any) are passed explicitly to align so they are used in priority
  • explicitly using a Coordinates object skips the creation of default indexes during merging (in collect_variables_and_indexes())
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1349321538 https://github.com/pydata/xarray/pull/7368#issuecomment-1349321538 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QbQNC benbovy 4160723 2022-12-13T18:03:17Z 2022-12-13T18:03:17Z MEMBER

I think this is ready for review!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1347327518 https://github.com/pydata/xarray/pull/7368#issuecomment-1347327518 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QTpYe benbovy 4160723 2022-12-12T21:05:56Z 2022-12-12T21:05:56Z MEMBER

In order to skip creating default indexes when passing a Coordinates object, I first tried a small refactor but in the end I found that the cleanest way to do it was to support alignment for Coordinates. I think it makes sense now that Coordinates is part of Xarray's public API as a "stand-alone" container like Dataset and DataArray.

The "no default index with Coordinates" behavior should be consistent Xarray-wise, i.e., for DataArray / Dataset constructors and also assign_coords, update, etc.

Sorry this PR is getting big, but hopefully this is almost ready (still a few tests to fix or to add).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1346344694 https://github.com/pydata/xarray/pull/7368#issuecomment-1346344694 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QP5b2 benbovy 4160723 2022-12-12T11:55:10Z 2022-12-12T11:55:10Z MEMBER

My suggestion would be: coords passed as a dict: create default indexes coords passed as IndexedCoordinates: do not create defaults

So if we already have some coordinate data as a dict but don't want any default index, we would need to do this:

python ds = xr.Dataset(coords=xr.Coordinates(my_coord_dict))

instead of this:

python ds = xr.Dataset(coords=my_coord_dict)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1346091151 https://github.com/pydata/xarray/pull/7368#issuecomment-1346091151 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QO7iP benbovy 4160723 2022-12-12T08:36:09Z 2022-12-12T08:36:09Z MEMBER

Thanks @shoyer, I've been thinking about similar short/long term plans although so far I haven't figured out how to implement your point 3. I'll give it another try.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1345646743 https://github.com/pydata/xarray/pull/7368#issuecomment-1345646743 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QNPCX shoyer 1217238 2022-12-11T20:17:15Z 2022-12-11T20:17:15Z MEMBER

I'm actually trying to merge IndexedCoordinates with Coordinates but I'm stuck: the latter is abstract and I don't really see how I could refactor it together with DatasetCoordinates and DataArrayCoordinates

Coordinates is abstract because in the (current) Xarray data model, it doesn't actually store any data -- coordinates are stored in private attributes of the original Dataset (._variables and ._coord_names) or DataArray (._coords). So Coordinates needs to serve as a proxy for the data.

In the long term, I think we should refactor Dataset/DataArray to actually store data (coordinate variables, indexes and dimension sizes) on Coordinates, but that's a bigger refactor.

For now, it's worth noting that the current Coordinates class isn't actually exposed in Xarray's public API, just the DatasetCoordinates and DataArrayCoordinates classes (and not even their constructors). So the intermdiate step I would try is: 1. Rename the current Coordinates baseclass to AbstractCoordinates. 2. Rename your IndexedCoordinates class to Coordinates. Expose it in the public API. Make sure it can handle DatasetCoordinates and DataArrayCoordinates in the constructor. 3. Maybe: use some Python magic to make DatasetCorodinates/DataArrayCoordinates subclasses of the new Coordinates. Or maybe make them actual subclassses, overriding many of the methods (including the constructor).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1345314909 https://github.com/pydata/xarray/pull/7368#issuecomment-1345314909 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QL-Bd benbovy 4160723 2022-12-10T16:59:44Z 2022-12-10T16:59:44Z MEMBER

Long term, do you think it would make sense to merge together Indexes, Coordinates and IndexedCoordinates? They are sort of all containers for the same thing.

Yes I think so.

I'm actually trying to merge IndexedCoordinates with Coordinates but I'm stuck: the latter is abstract and I don't really see how I could refactor it together with DatasetCoordinates and DataArrayCoordinates. Do you have any idea on how best to proceed?

Ideally, I'd see Coordinates be exposed in Xarray's main namespace with at least the two following constructors:

```python class Coordinates:

def __init__(
    self,
    coords: Mapping[Any, Any] | None = None,
    indexes: Mapping[Any, Index] | None = None,
):
    # Similar to Dataset.__init__ but without the need
    # to merge coords and data vars...
    # Probably ok to allow more flexibility / less safety here?
    ...

@classmethod
from_pandas_multiindex(cls, index: pd.MultiIndex, dim: str):
    ...

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1344968954 https://github.com/pydata/xarray/pull/7368#issuecomment-1344968954 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QKpj6 shoyer 1217238 2022-12-10T01:37:35Z 2022-12-10T01:37:35Z MEMBER

Long term, do you think it would make sense to merge together Indexes, Coordinates and IndexedCoordinates? They are sort of all containers for the same thing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1344944917 https://github.com/pydata/xarray/pull/7368#issuecomment-1344944917 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QKjsV shoyer 1217238 2022-12-10T00:31:46Z 2022-12-10T00:31:46Z MEMBER

what do you think about the approach proposed here? I'd like to check that with you before going further with tests, docs, etc.

Generally this looks great to me!

  • How to avoid building any default index? It seems silly to add or use the indexes argument just for that purpose? We could address that later.

My suggestion would be:

  • coords passed as a dict: create default indexes
  • coords passed as IndexedCoordinates: do not create defaults

Alternatively to an IndexedCoordinates subclass I was wondering if we could reuse the Coordinates base class?

Yes, this makes more sense to me!

What if the Indexes class was a facade based on IndexedCoordinates instead of the other way around?

Yes, I also agree! This makes more sense.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1344046801 https://github.com/pydata/xarray/pull/7368#issuecomment-1344046801 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QHIbR benbovy 4160723 2022-12-09T09:13:24Z 2022-12-09T09:16:35Z MEMBER

I added IndexedCoordinates.merge_coords so that it is easier to combine different coordinates to pass to a new Dataset / DataArray, e.g.,

```python midx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two"))

coords = xr.IndexedCoordinates.from_pandas_multiindex(midx, "x")

coords = coords.merge_coords({"y": [0, 1, 2]})

Coordinates:

* x (x) object MultiIndex

* one (x) object 'a' 'a' 'b' 'b'

* two (x) int64 1 2 1 2

* y (y) int64 0 1 2

ds = xr.Dataset(coords=coords)

<xarray.Dataset>

Dimensions: (x: 4)

Coordinates:

* x (x) object MultiIndex

* one (x) object 'a' 'a' 'b' 'b'

* two (x) int64 1 2 1 2

* y (y) int64 0 1 2

Data variables:

empty

```

IndexedCoordinates.merge_coords is very much like Coordinates.merge except that it returns a new Coordinates object instead of a Dataset.

Or should we just use merge? It would require that:

  • Coordinates.merge accepts Mapping[Any, Any] for its other argument. Only changing the type hint is enough here since the implementation already accepts any input passed to Dataset.
  • When a Dataset is passed as coords argument to a new Dataset and DataArray, both variables and indexes should be extracted. It is already the case for Dataset but I think it only works for PandasIndex and PandasMultiIndex (default indexes & backwards compatibility).
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1344004727 https://github.com/pydata/xarray/pull/7368#issuecomment-1344004727 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QG-J3 benbovy 4160723 2022-12-09T08:32:28Z 2022-12-09T09:14:17Z MEMBER

IndexedCoordinates and Indexes have a lot of overlap. At some point we might consider merging the two classes, like @shoyer suggests in https://github.com/pydata/xarray/pull/7214#issuecomment-1295283938. The main difference is that one is a mapping of coordinates and the other is a mapping of indexes. IndexedCoordinates is mostly reusing Indexes and Dataset under the hood, it is only a facade.

Alternatively to an IndexedCoordinates subclass I was wondering if we could reuse the Coordinates base class? There's some benefit of providing a subclass:

  • besides specific constructors like .from_pandas_multiindex() it has a generic __init__ for advanced use cases. Not sure it is a good idea to add this constructor to the base class?
  • unlike Coordinates, IndexedCoordinates is immutable.

What if the Indexes class was a facade based on IndexedCoordinates instead of the other way around? It would probably make more sense but it would also be a bigger refactor. I've chosen the easy way :).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 11.861ms · About: xarray-datasette