html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/pull/7368#issuecomment-1382070832,https://api.github.com/repos/pydata/xarray/issues/7368,1382070832,IC_kwDOAMm_X85SYLow,4160723,2023-01-13T16:13:16Z,2023-01-13T16:13:16Z,MEMBER,"Thanks for the review @shoyer. I addressed your comments. Everything seems OK except a rather annoying mypy error that I'm struggling with: The `DataAlignable` type variable should now encompass both `DataWithCoords` and `Coordinates`, since in this PR we add alignment support for the latter. I somewhat naively tried the options below without success: - `DataAlignable = TypeVar(""DataAlignable"", bound=DataWithCoords | Coordinates)` -> doesn't work since we cannot mix DataWithCoords and Coordinates when aligning each object (input type = output type) - `DataAlignable = TypeVar(""DataAlignable"", bound=DataWithCoords, Coordinates)` -> doesn't work with subclasses - `DataAlignable = TypeVar(""DataAlignable"", Dataset, DataArray, Coordinates)` -> doesn't work with generic types `T_Dataset`, etc.? - I even tried using a Protocol @headtr1ck @Illviljan any idea? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066 https://github.com/pydata/xarray/pull/7368#issuecomment-1359003371,https://api.github.com/repos/pydata/xarray/issues/7368,1359003371,IC_kwDOAMm_X85RAL7r,4160723,2022-12-20T08:34:06Z,2022-12-20T08:34:06Z,MEMBER,"I'm wondering if instead of `Coordinates.from_pandas_multiindex()` we might want to provide a more generic constructor available as an extension point? For example: `Coordinates.from_index(index_obj: Any, *, factory=None, **kwargs=None)` `factory` could be guessed from the type of `index_obj`. Xarray would support by default the `pandas.MultiIndex` and `pandas.Index` types. Like for IO backends, we could provide a `CoordinatesFactoryEntrypoint` so that it could support other index types. One downside is that specific (mandatory?) options like `dim` for a pandas (multi-)index are not directly visible. Would it be useful or is it overkill?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066 https://github.com/pydata/xarray/pull/7368#issuecomment-1352874809,https://api.github.com/repos/pydata/xarray/issues/7368,1352874809,IC_kwDOAMm_X85Qozs5,4160723,2022-12-15T10:42:59Z,2022-12-15T10:42:59Z,MEMBER,OK this is now ready for review (cc @shoyer).,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066 https://github.com/pydata/xarray/pull/7368#issuecomment-1352818155,https://api.github.com/repos/pydata/xarray/issues/7368,1352818155,IC_kwDOAMm_X85Qol3r,4160723,2022-12-15T09:59:03Z,2022-12-15T09:59:03Z,MEMBER,"> Maybe there's some way to optimize that? I don't know if we can completely avoid it with the solution implemented in this PR, though. Promoting Coordinates is pretty clean and future proof IMO (assuming that we'll further refactor Coordinates to actually store variables and indexes, i.e., not as a proxy anymore). Is the (minor? temporary?) regression in performance acceptable and can we just leave it like that for now? Fixed in [193dad3](https://github.com/pydata/xarray/pull/7368/commits/193dad3393565b6c007c0eb0a2d47b5ade874571) (with some reasonable special case added in `merge_core`).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066 https://github.com/pydata/xarray/pull/7368#issuecomment-1352310432,https://api.github.com/repos/pydata/xarray/issues/7368,1352310432,IC_kwDOAMm_X85Qmp6g,4160723,2022-12-14T22:33:23Z,2022-12-15T01:08:41Z,MEMBER,"I did some profiling to find the cause of the decrease in performance reported in the benchmarks (dataset creation). In summary, this is explained by a `Coordinates` object (built from the `coords` mapping) that is now included in objects to align when merging data vars and coordinates. Previously all non DataArray objects in the `coords` mapping were excluded from alignment (in `deep_align`). The introduced overhead comes from a call to `Coordinates._reindex_callback()`, which (I think?) should do no more than shallow copies and/or xarray wrapping stuff. In the benchmark report this is only marked as significant when creating small datasets (1.5-2x slower), and it becomes insignificant for datasets with more data variables. Maybe there's some way to optimize that? I don't know if we can completely avoid it with the solution implemented in this PR, though. Promoting `Coordinates` is pretty clean and future proof IMO (assuming that we'll further refactor `Coordinates` to actually store variables and indexes, i.e., not as a proxy anymore). Is the (minor? temporary?) regression in performance acceptable and can we just leave it like that for now? More details about the new workflow implemented in this PR when creating a new Dataset: - if Dataset's `coords` argument is a ""simple"" mapping, it is first internally converted into a `Coordinates` object, with the creation of default indexes for dimension coordinates - if one or more DataArray objects are given in `coords`, their coordinates (variables + indexes) are extracted and merged with the other input coordinates - see the implementation in `xarray.core.coordinates.create_coords_with_default_indexes` - otherwise, just reuse the `Coordinates` object passed as `coords` - coordinates are then merged with data variables - the `Coordinates` object is aligned with every other ""alignable"" object found in `data_vars` - coordinate indexes (if any) are passed explicitly to `align` so they are used in priority - explicitly using a `Coordinates` object skips the creation of default indexes during merging (in `collect_variables_and_indexes()`) ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066 https://github.com/pydata/xarray/pull/7368#issuecomment-1349321538,https://api.github.com/repos/pydata/xarray/issues/7368,1349321538,IC_kwDOAMm_X85QbQNC,4160723,2022-12-13T18:03:17Z,2022-12-13T18:03:17Z,MEMBER,I think this is ready for review!,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066 https://github.com/pydata/xarray/pull/7368#issuecomment-1347327518,https://api.github.com/repos/pydata/xarray/issues/7368,1347327518,IC_kwDOAMm_X85QTpYe,4160723,2022-12-12T21:05:56Z,2022-12-12T21:05:56Z,MEMBER,"In order to skip creating default indexes when passing a `Coordinates` object, I first tried a small refactor but in the end I found that the cleanest way to do it was to support alignment for `Coordinates`. I think it makes sense now that Coordinates is part of Xarray's public API as a ""stand-alone"" container like Dataset and DataArray. The ""no default index with Coordinates"" behavior should be consistent Xarray-wise, i.e., for DataArray / Dataset constructors and also `assign_coords`, `update`, etc. Sorry this PR is getting big, but hopefully this is almost ready (still a few tests to fix or to add). ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066 https://github.com/pydata/xarray/pull/7368#issuecomment-1346344694,https://api.github.com/repos/pydata/xarray/issues/7368,1346344694,IC_kwDOAMm_X85QP5b2,4160723,2022-12-12T11:55:10Z,2022-12-12T11:55:10Z,MEMBER,"> My suggestion would be: coords passed as a dict: create default indexes coords passed as IndexedCoordinates: do not create defaults So if we already have some coordinate data as a dict but don't want any default index, we would need to do this: ```python ds = xr.Dataset(coords=xr.Coordinates(my_coord_dict)) ``` instead of this: ```python ds = xr.Dataset(coords=my_coord_dict) ``` ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066 https://github.com/pydata/xarray/pull/7368#issuecomment-1346091151,https://api.github.com/repos/pydata/xarray/issues/7368,1346091151,IC_kwDOAMm_X85QO7iP,4160723,2022-12-12T08:36:09Z,2022-12-12T08:36:09Z,MEMBER,"Thanks @shoyer, I've been thinking about similar short/long term plans although so far I haven't figured out how to implement your point 3. I'll give it another try.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066 https://github.com/pydata/xarray/pull/7368#issuecomment-1345314909,https://api.github.com/repos/pydata/xarray/issues/7368,1345314909,IC_kwDOAMm_X85QL-Bd,4160723,2022-12-10T16:59:44Z,2022-12-10T16:59:44Z,MEMBER,"> Long term, do you think it would make sense to merge together Indexes, Coordinates and IndexedCoordinates? They are sort of all containers for the same thing. Yes I think so. I'm actually trying to merge `IndexedCoordinates` with `Coordinates` but I'm stuck: the latter is abstract and I don't really see how I could refactor it together with `DatasetCoordinates` and `DataArrayCoordinates`. Do you have any idea on how best to proceed? Ideally, I'd see `Coordinates` be exposed in Xarray's main namespace with at least the two following constructors: ```python class Coordinates: def __init__( self, coords: Mapping[Any, Any] | None = None, indexes: Mapping[Any, Index] | None = None, ): # Similar to Dataset.__init__ but without the need # to merge coords and data vars... # Probably ok to allow more flexibility / less safety here? ... @classmethod from_pandas_multiindex(cls, index: pd.MultiIndex, dim: str): ... ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066 https://github.com/pydata/xarray/pull/7368#issuecomment-1344046801,https://api.github.com/repos/pydata/xarray/issues/7368,1344046801,IC_kwDOAMm_X85QHIbR,4160723,2022-12-09T09:13:24Z,2022-12-09T09:16:35Z,MEMBER,"I added `IndexedCoordinates.merge_coords` so that it is easier to combine different coordinates to pass to a new Dataset / DataArray, e.g., ```python midx = pd.MultiIndex.from_product([[""a"", ""b""], [1, 2]], names=(""one"", ""two"")) coords = xr.IndexedCoordinates.from_pandas_multiindex(midx, ""x"") coords = coords.merge_coords({""y"": [0, 1, 2]}) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' 'a' 'b' 'b' # * two (x) int64 1 2 1 2 # * y (y) int64 0 1 2 ds = xr.Dataset(coords=coords) # # Dimensions: (x: 4) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' 'a' 'b' 'b' # * two (x) int64 1 2 1 2 # * y (y) int64 0 1 2 # Data variables: # *empty* ``` `IndexedCoordinates.merge_coords` is very much like `Coordinates.merge` except that it returns a new Coordinates object instead of a Dataset. Or should we just use `merge`? It would require that: - `Coordinates.merge` accepts `Mapping[Any, Any]` for its `other` argument. Only changing the type hint is enough here since the implementation already accepts any input passed to Dataset. - When a Dataset is passed as `coords` argument to a new Dataset and DataArray, both variables and indexes should be extracted. It is already the case for Dataset but I think it only works for PandasIndex and PandasMultiIndex (default indexes & backwards compatibility).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066 https://github.com/pydata/xarray/pull/7368#issuecomment-1344004727,https://api.github.com/repos/pydata/xarray/issues/7368,1344004727,IC_kwDOAMm_X85QG-J3,4160723,2022-12-09T08:32:28Z,2022-12-09T09:14:17Z,MEMBER,"`IndexedCoordinates` and `Indexes` have a lot of overlap. At some point we might consider merging the two classes, like @shoyer suggests in https://github.com/pydata/xarray/pull/7214#issuecomment-1295283938. The main difference is that one is a mapping of coordinates and the other is a mapping of indexes. `IndexedCoordinates` is mostly reusing `Indexes` and `Dataset` under the hood, it is only a facade. Alternatively to an `IndexedCoordinates` subclass I was wondering if we could reuse the `Coordinates` base class? There's some benefit of providing a subclass: - besides specific constructors like `.from_pandas_multiindex()` it has a generic `__init__` for advanced use cases. Not sure it is a good idea to add this constructor to the base class? - unlike Coordinates, IndexedCoordinates is immutable. What if the `Indexes` class was a facade based on `IndexedCoordinates` instead of the other way around? It would probably make more sense but it would also be a bigger refactor. I've chosen the easy way :). ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1485037066