home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

12 rows where issue = 955936490 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 2

  • benbovy 9
  • shoyer 3

issue 1

  • Flexible indexes: review the implementation of alignment and merge · 12 ✖

author_association 1

  • MEMBER 12
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1239162673 https://github.com/pydata/xarray/issues/5647#issuecomment-1239162673 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X85J3B8x benbovy 4160723 2022-09-07T09:47:13Z 2022-09-07T09:47:13Z MEMBER

I think we can close this issue, which has mostly been addressed in #5692. I opened #7002 for the potential problem of index alignment vs. coordinate ordering.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490
949618104 https://github.com/pydata/xarray/issues/5647#issuecomment-949618104 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X844mgW4 benbovy 4160723 2021-10-22T13:11:01Z 2021-10-22T13:11:01Z MEMBER

@shoyer I'm now reviewing the merging logic. I works mostly fine, with some minor concerns:

  • MergeElement (variable, index) tuples are not grouped per matching (unique) index, so index equality will be checked more than once for multi-coordinate indexes. Not a big deal I think, unless for large multi-indexes with many levels. A proper fix for this would probably require some heavier refactoring, though. Is it ok to leave this as is for now?

  • I'm not sure how we should handle the case where foo is the name of both an un-indexed dimension and a coordinate with other dimension(s).

Let's take this example:

```python

from xarray.tests.test_dataset import create_test_multiindex data = create_test_multiindex() data <xarray.Dataset> Dimensions: (x: 4) Coordinates: * x (x) MultiIndex - level_1 (x) object 'a' 'a' 'b' 'b' - level_2 (x) int64 1 2 1 2 Data variables: empty python other = xr.Dataset({"z": ("level_1", [0, 1])}) merged = data.merge(other) ValueError: conflicting level / dimension names. level_1 already exists as a level name. ```

Do we raise a ValueError here because otherwise we don't know what to return from merged["level_1"] (i.e., the multi-index virtual coordinate or a default dimension coordinate)? Can we allow this case now that multi-index level coordinates are not virtual anymore?

Note: the following example does not raise any error in xarray:

```python

data = xr.Dataset(coords={"x": [0, 1, 2, 3], "level_1": ("x", ["a", "a", "b", "b"])}) other = xr.Dataset({"z": ("level_1", [0, 1])}) merged = data.merge(other) merged <xarray.Dataset> Dimensions: (x: 4, level_1: 2) Coordinates: * x (x) int64 0 1 2 3 level_1 (x) <U1 'a' 'a' 'b' 'b' Data variables: z (level_1) int64 0 1 ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490
946883858 https://github.com/pydata/xarray/issues/5647#issuecomment-946883858 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X844cE0S benbovy 4160723 2021-10-19T16:16:24Z 2021-10-19T16:16:24Z MEMBER

Other possible solutions (from discussion with @shoyer):

  1. Take care of coordinate order (and maybe other things) inside Index.join and Index.equals, e.g., for PandasMultiIndex maybe reorder the levels beforehand.
    • pros: more flexible
    • cons: not great to implicitly reorder levels if it's a costly operation?
  2. Find matching indexes using a two-passes approach: (1) group all indexes by dimension name and (2) check compatibility between the indexes listed in each group.
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490
946546464 https://github.com/pydata/xarray/issues/5647#issuecomment-946546464 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X844aycg benbovy 4160723 2021-10-19T09:47:04Z 2021-10-19T09:47:04Z MEMBER

I've used the following key type to find matching indexes:

python CoordNamesAndDims = FrozenSet[Tuple[Hashable, Tuple[Hashable, ...]]] MatchingIndexKey = Tuple[CoordNamesAndDims, Type[Index]]

where the order of coordinates doesn't matter. For PandasMultiIndex this is wrong, we should use a Tuple instead of a FrozenSet (assuming that the coordinate names are properly ordered in obj.xindexes), but I'm wondering if using a Tuple wouldn't be too restrictive in some other cases.

Are there potential custom indexes where the order of coordinates doesn't matter? Maybe a good example is a meta-index for staggered grids where the cell center coordinate and the cell edges coordinates might be given in any order.

Possible solutions to address this:

  1. Xarray indexes may reorder the coordinate variables, possibly via Index.create_variables(), to ensure consistent order
  2. Xarray indexes must implement the Index.matching_key abstract property in order to support re-indexing and alignment.

Option 2 is more flexible but option 1 might be enough. Option 1 may also be great for clearer indexes and coordinates sections in Xarray objects repr.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490
945933099 https://github.com/pydata/xarray/issues/5647#issuecomment-945933099 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X844Ycsr benbovy 4160723 2021-10-18T16:09:11Z 2021-10-18T16:09:11Z MEMBER

If we like, we could probably even add static type information on reindex_like, to require the same index type

Yes, I did that for Index.join() and Index.reindex_like() in #5692.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490
945931002 https://github.com/pydata/xarray/issues/5647#issuecomment-945931002 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X844YcL6 shoyer 1217238 2021-10-18T16:07:06Z 2021-10-18T16:07:06Z MEMBER

For alignment, I think we could directly call idx.reindex_like(other) for each index in each object to align where other corresponds to the aligned index that matches idx.

Sounds good to me!

If we like, we could probably even add static type information on reindex_like, to require the same index type: ```python T = TypeVar("T", bound="Index")

... def reindex_like(self: T, other: T) -> DimIntIndexers: ... ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490
945928870 https://github.com/pydata/xarray/issues/5647#issuecomment-945928870 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X844Ybqm shoyer 1217238 2021-10-18T16:04:52Z 2021-10-18T16:04:52Z MEMBER

2. dim has a multi-index, labels is a pd.MultiIndex and level names exactly match: no real issue either in this case, but shouldn't we discourage this implicit behavior in the mid/long term and instead encourage using reindex_like for such more advanced use case?

I agree, this doesn't make sense in the long term because the "name" of the MultiIndex is no longer necessary: it's just the same index that happens to index each of the levels.

Let's preserve it (for now) for backwards compat, but in the long term the ideal is certainly either (a) using reindex_like or (b) using reindex with separate kwargs for each level

3. dim has a multi-index and labels is (cast to) a single pandas index (or the other way around): this is currently possible in Xarray but it seems silly? After re-indexing, all data along dim is filled with fill_value... Would it be fine to instead raise an error now? Would it really break any user case?

If part of Xarray's API currently doesn't raise an error but instead returns all NaNs, and this case can be detected based on static type information (e.g., shapes, dtypes, dimension names, variable name, index types), then I agree that the best user experience is almost certainly to raise an error instead.

NaNs in numerical computing are essentially a mechanism for "runtime error handling" that can arise from values in array computation. If something can be identified based on type information, that is a better user experience.

4. dim has a multi-index, labels is a pd.MultiIndex and multi-index level names don't match: same problem than for case 3.

Cases 3 and 4 are a big obstacle for me right now. I really don't know how we can still support those special cases without deeply re-thinking the problem. If they could be considered as a bug, then the new implementation would already raise an nice error message :-).

I think we can consider these edge cases bugs and fix them :)

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490
945592260 https://github.com/pydata/xarray/issues/5647#issuecomment-945592260 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X844XJfE benbovy 4160723 2021-10-18T09:43:18Z 2021-10-18T10:47:26Z MEMBER

Ok, I'm now hitting another obstacle while working on reindex.

So one general approach for both alignment and re-indexing that is "pretty straightforward" to implement with the new Xarray index data model is: (1) find matching indexes based on their corresponding coordinate/dimension names and index type, and (2) call idx.join(other) and/or idx.reindex_like(other) where other is another index object of the same type than idx. This is what I've done so far in #5692.

Relaxing any of the constraints in (1) would be much more complicated to implement. We would need to do some sort of mapping from dimension labels to all involved (multi-/meta-)indexes, then check for conflicts in dimension indexers returned from multiple indexes, possibly handle/remove multi-index coordinates (or convert back to non-indexed coordinates), etc.

One problem with Dataset.reindex and DataArray.reindex is that we can pass any {dim: labels} as indexers. How should we adapt the API for flexible indexes? How to ensure backwards compatibility? The current possible cases are:

  1. Both dim and labels correspond (or may be cast) to single pandas indexes: there's no issue in this case with the constraints stated in (1) above.

  2. dim has a multi-index, labels is a pd.MultiIndex and level names exactly match: no real issue either in this case, but shouldn't we discourage this implicit behavior in the mid/long term and instead encourage using reindex_like for such more advanced use case?

  3. dim has a multi-index and labels is (cast to) a single pandas index (or the other way around): this is currently possible in Xarray but it seems silly? After re-indexing, all data along dim is filled with fill_value... Would it be fine to instead raise an error now? Would it really break any user case?

  4. dim has a multi-index, labels is a pd.MultiIndex and multi-index level names don't match: same problem than for case 3.

Cases 3 and 4 are a big obstacle for me right now. I really don't know how we can still support those special cases without deeply re-thinking the problem. If they could be considered as a bug, then the new implementation would already raise an nice error message :-).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490
944273383 https://github.com/pydata/xarray/issues/5647#issuecomment-944273383 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X844SHfn benbovy 4160723 2021-10-15T12:53:45Z 2021-10-15T12:57:38Z MEMBER

I've now re-implemented most of the alignment logic in #5692, using a dedicated class so that all intermediate steps are implemented in a clearer and more maintainable way.

One remaining task is to get the re-indexing logic right. Rather than relying on an Index.dim_indexers property, I feel that it'd be better to have two methods:

```python DimIntIndexers = Dict[Hashable, Any]

class Index: def reindex(self, dim_labels: Mapping[Hashable, Any]) -> DimIntIndexers: ...

def reindex_like(self, other: "Index") -> DimIntIndexers:
    ...

```

For alignment, I think we could directly call idx.reindex_like(other) for each index in each object to align where other corresponds to the aligned index that matches idx.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490
937865982 https://github.com/pydata/xarray/issues/5647#issuecomment-937865982 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X8435rL- benbovy 4160723 2021-10-07T14:48:02Z 2021-10-07T14:48:02Z MEMBER

@shoyer I'm looking more deeply into this. I think it will be impossible to avoid a heavy refactoring in core/alignment.py if we want to support "meta-indexes" and handle non-dimension indexed coordinates, so I'd like to check with you about how best we can tackle this problem.

I'm thinking about the following approach:

  1. Instead of looking at the dimensions of each object to align, look at their indexes.

    • matching indexes must all correspond to the same set of coordinate names and dimension names (should we also check the index type?)
    • if coordinate names vs. indexes match only partially, then raise an error
  2. For each of the matched indexes, check for index equality and perform join if that's not the case (or just pick the matching index that has been explicitly passed to align via its indexes argument)

    • For convenience, we could expose an Index.join(self, other, how) method instead of having two separate Index.union() and Index.intersection() methods. Index.join() might also return index coordinate variables that we can assign to the aligned objects
    • if the indexes do not provide an implementation for .equals, check for coordinate equality
    • if they do not provide an implementation for .join, raise an error or ignore it? Probably better to ignore it, i.e., do not create a joined index in the aligned objects and align the coordinates just as regular variables (this might be taken care by another index)...
  3. Add an Index.dim_indexers(self) -> Dict[Hashable, Any] property that returns label indexers for each dimension involved in the index, and which will be used to conform the objects to the joined indexes.

    • this is probably where we could make sure to not cast str coords to object for pandas indexes
  4. Merge dim_indexers returned by all joined indexes and raise if there's any conflict (i.e., two distinct indexes should not return indexers along a common dimension).

  5. Check the size of unlabelled dimensions (also against the sizes found in the merged dim_indexers for matching dimensions)

  6. Reindex the objects

Does that sounds right to you? I'd really appreciate any guidance on this before going further as I'm worried about missing something important or another more straightforward approach.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490
889483293 https://github.com/pydata/xarray/issues/5647#issuecomment-889483293 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X841BHAd benbovy 4160723 2021-07-29T21:49:02Z 2021-07-29T21:49:02Z MEMBER

We should raise an error in this cases (and/or suggest setting a MultiIndex). It should also be OK if not every index implements alignment, in which case they should raise an error if coordinates do not match exactly.

Agreed.

I don't think we should try to support alignment with multi-dimensional (non-orthogonal) inside align(). Instead we can just require the coordinates corresponding to such indexes to match exactly (or within a tolerance).

I could see some examples (e.g., union or intersection of staggered grids) where it could still be useful to have a (meta-)index that implements alignment, though.

Actually, while working on #5636 I was thinking more about how to perform alignment based on xarray.Index objects given that the concept of an xarray.Index should be clearly distinct from the concept of a Variable.

We could maybe add an xarray.Index.sizes property similarly to Variable.sizes.

Alternatively, we could allow xarray.Index.union() and xarray.Index.intersection() returning both the joined index and the new Variable objects created from the latter index.

I'd lean towards the 2nd option, which is consistent with the class method constructors added in #5636. Not sure the 1st option is a good idea and it doesn't solve all the issues mentioned in my comment above.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490
889288761 https://github.com/pydata/xarray/issues/5647#issuecomment-889288761 https://api.github.com/repos/pydata/xarray/issues/5647 IC_kwDOAMm_X841AXg5 shoyer 1217238 2021-07-29T16:26:01Z 2021-07-29T16:26:01Z MEMBER

Conceptually, align should ensure that all indexes in the provided arguments match.

This is potentially ambiguous for cases with multiple (non-MultiIndex) indexes, if the result of aligning the separate indexes does not match, e.g., if we have: - Index for x along dimension x: [1, 2, 3] vs [1, 2] - Index for y along dimension x: [1, 2, 3] vs [2, 3]

We should raise an error in this cases (and/or suggest setting a MultiIndex). It should also be OK if not every index implements alignment, in which case they should raise an error if coordinates do not match exactly.

With regards to your concern:

This currently works well since a pd.Index can be directly treated as a 1-d array but this won’t be always the case anymore with custom indexes.

I don't think we should try to support alignment with multi-dimensional (non-orthogonal) inside align(). Instead we can just require the coordinates correspondign to such indexes to match exactly (or within a tolerance). I can see supporting alignment between two DataArrays that are indexed by a KDTreeIndex over multiple coordintaes along a single dimension in the same way tthat we support alignment with MultiIndex indexes currently.

But if your indexes correpsond to multi-dimensional arrays (rather than just multiple coordinates), joining indexes together is a much messier operation, one that may not be possible without thinking carefully about interpolation/regridding. In many cases it may not be possible to retain the multi-dimensional nature of the indexes in the result (e.g., the union of two partially overlapping grids). Since the desired behavior is not clear, it is better to force the user to make a choice, either by stacking the dimensions in multi-dimensional indexes into 1D (like a MultiIndex) or by calling a specialized method for interpolation/regridding.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible indexes: review the implementation of alignment and merge 955936490

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1761.146ms · About: xarray-datasette