html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/5647#issuecomment-1239162673,https://api.github.com/repos/pydata/xarray/issues/5647,1239162673,IC_kwDOAMm_X85J3B8x,4160723,2022-09-07T09:47:13Z,2022-09-07T09:47:13Z,MEMBER,"I think we can close this issue, which has mostly been addressed in #5692. I opened #7002 for the potential problem of index alignment vs. coordinate ordering.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,955936490 https://github.com/pydata/xarray/issues/5647#issuecomment-949618104,https://api.github.com/repos/pydata/xarray/issues/5647,949618104,IC_kwDOAMm_X844mgW4,4160723,2021-10-22T13:11:01Z,2021-10-22T13:11:01Z,MEMBER,"@shoyer I'm now reviewing the merging logic. I works mostly fine, with some minor concerns: - `MergeElement` (variable, index) tuples are not grouped per matching (unique) index, so index equality will be checked more than once for multi-coordinate indexes. Not a big deal I think, unless for large multi-indexes with many levels. A proper fix for this would probably require some heavier refactoring, though. Is it ok to leave this as is for now? - I'm not sure how we should handle the case where `foo` is the name of both an un-indexed dimension and a coordinate with other dimension(s). Let's take this example: ```python >>> from xarray.tests.test_dataset import create_test_multiindex >>> data = create_test_multiindex() >>> data Dimensions: (x: 4) Coordinates: * x (x) MultiIndex - level_1 (x) object 'a' 'a' 'b' 'b' - level_2 (x) int64 1 2 1 2 Data variables: *empty* ``` ```python >>> other = xr.Dataset({""z"": (""level_1"", [0, 1])}) >>> merged = data.merge(other) ValueError: conflicting level / dimension names. level_1 already exists as a level name. ``` Do we raise a `ValueError` here because otherwise we don't know what to return from `merged[""level_1""]` (i.e., the multi-index virtual coordinate or a default dimension coordinate)? Can we allow this case now that multi-index level coordinates are not virtual anymore? Note: the following example does not raise any error in xarray: ```python >>> data = xr.Dataset(coords={""x"": [0, 1, 2, 3], ""level_1"": (""x"", [""a"", ""a"", ""b"", ""b""])}) >>> other = xr.Dataset({""z"": (""level_1"", [0, 1])}) >>> merged = data.merge(other) >>> merged Dimensions: (x: 4, level_1: 2) Coordinates: * x (x) int64 0 1 2 3 level_1 (x) If we like, we could probably even add static type information on reindex_like, to require the same index type Yes, I did that for `Index.join()` and `Index.reindex_like()` in #5692. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,955936490 https://github.com/pydata/xarray/issues/5647#issuecomment-945931002,https://api.github.com/repos/pydata/xarray/issues/5647,945931002,IC_kwDOAMm_X844YcL6,1217238,2021-10-18T16:07:06Z,2021-10-18T16:07:06Z,MEMBER,"> For alignment, I think we could directly call `idx.reindex_like(other)` for each index in each object to align where `other` corresponds to the aligned index that matches `idx`. Sounds good to me! If we like, we could probably even add static type information on `reindex_like`, to require the same index type: ```python T = TypeVar(""T"", bound=""Index"") ... def reindex_like(self: T, other: T) -> DimIntIndexers: ... ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,955936490 https://github.com/pydata/xarray/issues/5647#issuecomment-945928870,https://api.github.com/repos/pydata/xarray/issues/5647,945928870,IC_kwDOAMm_X844Ybqm,1217238,2021-10-18T16:04:52Z,2021-10-18T16:04:52Z,MEMBER,"> 2\. `dim` has a multi-index, `labels` is a `pd.MultiIndex` and level names exactly match: no real issue either in this case, but shouldn't we discourage this implicit behavior in the mid/long term and instead encourage using `reindex_like` for such more advanced use case? I agree, this doesn't make sense in the long term because the ""name"" of the MultiIndex is no longer necessary: it's just the same index that happens to index each of the levels. Let's preserve it (for now) for backwards compat, but in the long term the ideal is certainly either (a) using `reindex_like` or (b) using `reindex` with separate kwargs for each level > 3\. `dim` has a multi-index and `labels` is (cast to) a single pandas index (or the other way around): this is currently possible in Xarray but it seems silly? After re-indexing, all data along `dim` is filled with `fill_value`... Would it be fine to instead raise an error _now_? Would it really break any user case? If part of Xarray's API currently doesn't raise an error but instead returns all NaNs, and this case can be detected based on static _type_ information (e.g., shapes, dtypes, dimension names, variable name, index types), then I agree that the best user experience is almost certainly to raise an error instead. NaNs in numerical computing are essentially a mechanism for ""runtime error handling"" that can arise from values in array computation. If something can be identified based on type information, that is a better user experience. > 4\. `dim` has a multi-index, `labels` is a `pd.MultiIndex` and multi-index level names don't match: same problem than for case 3. > > Cases 3 and 4 are a big obstacle for me right now. I really don't know how we can still support those special cases without deeply re-thinking the problem. If they could be considered as a bug, then the new implementation would already raise an nice error message :-). I think we can consider these edge cases bugs and fix them :)","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,955936490 https://github.com/pydata/xarray/issues/5647#issuecomment-945592260,https://api.github.com/repos/pydata/xarray/issues/5647,945592260,IC_kwDOAMm_X844XJfE,4160723,2021-10-18T09:43:18Z,2021-10-18T10:47:26Z,MEMBER,"Ok, I'm now hitting another obstacle while working on reindex. So one general approach for both alignment and re-indexing that is ""pretty straightforward"" to implement with the new Xarray index data model is: (1) find matching indexes based on their corresponding coordinate/dimension names and index type, and (2) call `idx.join(other)` and/or `idx.reindex_like(other)` where `other` is another index object of the same type than `idx`. This is what I've done so far in #5692. Relaxing any of the constraints in (1) would be much more complicated to implement. We would need to do some sort of mapping from dimension labels to all involved (multi-/meta-)indexes, then check for conflicts in dimension indexers returned from multiple indexes, possibly handle/remove multi-index coordinates (or convert back to non-indexed coordinates), etc. One problem with `Dataset.reindex` and `DataArray.reindex` is that we can pass any `{dim: labels}` as indexers. How should we adapt the API for flexible indexes? How to ensure backwards compatibility? The current possible cases are: 1. Both `dim` and `labels` correspond (or may be cast) to single pandas indexes: there's no issue in this case with the constraints stated in (1) above. 2. `dim` has a multi-index, `labels` is a `pd.MultiIndex` and level names exactly match: no real issue either in this case, but shouldn't we discourage this implicit behavior in the mid/long term and instead encourage using `reindex_like` for such more advanced use case? 3. `dim` has a multi-index and `labels` is (cast to) a single pandas index (or the other way around): this is currently possible in Xarray but it seems silly? After re-indexing, all data along `dim` is filled with `fill_value`... Would it be fine to instead raise an error *now*? Would it really break any user case? 4. `dim` has a multi-index, `labels` is a `pd.MultiIndex` and multi-index level names don't match: same problem than for case 3. Cases 3 and 4 are a big obstacle for me right now. I really don't know how we can still support those special cases without deeply re-thinking the problem. If they could be considered as a bug, then the new implementation would already raise an nice error message :-).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,955936490 https://github.com/pydata/xarray/issues/5647#issuecomment-944273383,https://api.github.com/repos/pydata/xarray/issues/5647,944273383,IC_kwDOAMm_X844SHfn,4160723,2021-10-15T12:53:45Z,2021-10-15T12:57:38Z,MEMBER,"I've now re-implemented most of the alignment logic in #5692, using a dedicated class so that all intermediate steps are implemented in a clearer and more maintainable way. One remaining task is to get the re-indexing logic right. Rather than relying on an `Index.dim_indexers` property, I feel that it'd be better to have two methods: ```python DimIntIndexers = Dict[Hashable, Any] class Index: def reindex(self, dim_labels: Mapping[Hashable, Any]) -> DimIntIndexers: ... def reindex_like(self, other: ""Index"") -> DimIntIndexers: ... ``` For alignment, I think we could directly call `idx.reindex_like(other)` for each index in each object to align where `other` corresponds to the aligned index that matches `idx`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,955936490 https://github.com/pydata/xarray/issues/5647#issuecomment-937865982,https://api.github.com/repos/pydata/xarray/issues/5647,937865982,IC_kwDOAMm_X8435rL-,4160723,2021-10-07T14:48:02Z,2021-10-07T14:48:02Z,MEMBER,"@shoyer I'm looking more deeply into this. I think it will be impossible to avoid a heavy refactoring in `core/alignment.py` if we want to support ""meta-indexes"" and handle non-dimension indexed coordinates, so I'd like to check with you about how best we can tackle this problem. I'm thinking about the following approach: 1. Instead of looking at the dimensions of each object to align, look at their indexes. - matching indexes must all correspond to the same set of coordinate names *and* dimension names (should we also check the index type?) - if coordinate names vs. indexes match only partially, then raise an error 2. For each of the matched indexes, check for index equality and perform join if that's not the case (or just pick the matching index that has been explicitly passed to `align` via its `indexes` argument) - For convenience, we could expose an `Index.join(self, other, how)` method instead of having two separate `Index.union()` and `Index.intersection()` methods. `Index.join()` might also return index coordinate variables that we can assign to the aligned objects - if the indexes do not provide an implementation for `.equals`, check for coordinate equality - if they do not provide an implementation for `.join`, raise an error or ignore it? Probably better to ignore it, i.e., do not create a joined index in the aligned objects and align the coordinates just as regular variables (this might be taken care by another index)... 3. Add an `Index.dim_indexers(self) -> Dict[Hashable, Any]` property that returns label indexers for each dimension involved in the index, and which will be used to conform the objects to the joined indexes. - this is probably where we could [make sure to not cast str coords to object for pandas indexes](https://github.com/pydata/xarray/blob/5499949b5937277dcd599a182201d0e2fc5e818e/xarray/core/alignment.py#L327) 4. Merge `dim_indexers` returned by all joined indexes and raise if there's any conflict (i.e., two distinct indexes should not return indexers along a common dimension). 5. Check the size of unlabelled dimensions (also against the sizes found in the merged `dim_indexers` for matching dimensions) 6. Reindex the objects Does that sounds right to you? I'd really appreciate any guidance on this before going further as I'm worried about missing something important or another more straightforward approach. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,955936490 https://github.com/pydata/xarray/issues/5647#issuecomment-889483293,https://api.github.com/repos/pydata/xarray/issues/5647,889483293,IC_kwDOAMm_X841BHAd,4160723,2021-07-29T21:49:02Z,2021-07-29T21:49:02Z,MEMBER,"> We should raise an error in this cases (and/or suggest setting a MultiIndex). It should also be OK if not every index implements alignment, in which case they should raise an error if coordinates do not match exactly. Agreed. > I don't think we should try to support alignment with multi-dimensional (non-orthogonal) inside align(). Instead we can just require the coordinates corresponding to such indexes to match exactly (or within a tolerance). I could see some examples (e.g., union or intersection of staggered grids) where it could still be useful to have a (meta-)index that implements alignment, though. Actually, while working on #5636 I was thinking more about how to perform alignment based on `xarray.Index` objects given that the concept of an `xarray.Index` should be clearly distinct from the concept of a `Variable`. We could maybe add an `xarray.Index.sizes` property similarly to `Variable.sizes`. Alternatively, we could allow `xarray.Index.union()` and `xarray.Index.intersection()` returning both the joined index and the new `Variable` objects created from the latter index. I'd lean towards the 2nd option, which is consistent with the class method constructors added in #5636. Not sure the 1st option is a good idea and it doesn't solve all the issues mentioned in my comment above. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,955936490 https://github.com/pydata/xarray/issues/5647#issuecomment-889288761,https://api.github.com/repos/pydata/xarray/issues/5647,889288761,IC_kwDOAMm_X841AXg5,1217238,2021-07-29T16:26:01Z,2021-07-29T16:26:01Z,MEMBER,"Conceptually, `align` should ensure that all indexes in the provided arguments match. This is potentially ambiguous for cases with multiple (non-MultiIndex) indexes, if the result of aligning the separate indexes does not match, e.g., if we have: - Index for `x` along dimension `x`: `[1, 2, 3]` vs `[1, 2]` - Index for `y` along dimension `x`: `[1, 2, 3]` vs `[2, 3]` We should raise an error in this cases (and/or suggest setting a MultiIndex). It should also be OK if not every index implements alignment, in which case they should raise an error if coordinates do not match exactly. With regards to your concern: > This currently works well since a pd.Index can be directly treated as a 1-d array but this won’t be always the case anymore with custom indexes. I don't think we should try to support alignment with multi-dimensional (non-orthogonal) inside `align()`. Instead we can just require the coordinates correspondign to such indexes to match exactly (or within a tolerance). I can see supporting alignment between two DataArrays that are indexed by a KDTreeIndex over multiple coordintaes along a single dimension in the same way tthat we support alignment with MultiIndex indexes currently. But if your indexes correpsond to multi-dimensional arrays (rather than just multiple coordinates), joining indexes together is a much messier operation, one that may not be possible without thinking carefully about interpolation/regridding. In many cases it may not be possible to retain the multi-dimensional nature of the indexes in the result (e.g., the union of two partially overlapping grids). Since the desired behavior is not clear, it is better to force the user to make a choice, either by stacking the dimensions in multi-dimensional indexes into 1D (like a MultiIndex) or by calling a specialized method for interpolation/regridding.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,955936490