home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

4 rows where issue = 1372035441 and user = 4160723 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • benbovy · 4 ✖

issue 1

  • Periodic Boundary Index · 4 ✖

author_association 1

  • MEMBER 4
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1249218631 https://github.com/pydata/xarray/issues/7031#issuecomment-1249218631 https://api.github.com/repos/pydata/xarray/issues/7031 IC_kwDOAMm_X85KdZBH benbovy 4160723 2022-09-16T10:50:10Z 2022-09-16T10:50:10Z MEMBER

In general, it seems like most (nearly all?) 1D indexing use cases can be handled by encapsulating a PandasIndex (see also https://github.com/dcherian/crsindex). So perhaps we should just recommend that and add a lot more comments to PandasIndex to make it easier to build on.

I've created a MultiPandasIndex helper class for that purpose: notebook.

I've extracted the boilerplate from @dcherian's CRSIndex and I've implemented the remaining Index API. It raised a couple of issues, notably for Index.isel which should probably return a dict[Hashable, Index] instead of Index | None (the latter is not flexible enough, i.e., when the dimensions of the meta-index are partially dropped in the selection).

The other thing I will think about is whether anything special needs to happen for 2D+ periodicity. I suspect that for integers you could just use independent 1D indexes along each dim but for slicing across the "dateline" it might get messy in higher dimensions...

Yeah I guess it will work well with independent PeriodicBoundaryIndex instances (possibly grouped in a MultiPandasIndex) for gridded data.

For multi-dimension coordinates with periodic boundaries this would probably be best handled by more specific indexes, e.g., xoak's s2point index that supports periodicity for lat/lon data (I think).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Periodic Boundary Index 1372035441
1247731386 https://github.com/pydata/xarray/issues/7031#issuecomment-1247731386 https://api.github.com/repos/pydata/xarray/issues/7031 IC_kwDOAMm_X85KXt66 benbovy 4160723 2022-09-15T08:03:50Z 2022-09-15T08:03:50Z MEMBER

Great @TomNicholas!

To avoid copying the body of PandasIndex.sel, couldn't you "just" do something like this?

```python class PeriodicBoundaryIndex(PandasIndex): """ An index representing any 1D periodic numberline.

Implementation subclasses a normal xarray PandasIndex object but intercepts indexer queries.
"""
period: float

def __init__(self, *args, period=360, **kwargs):
    super().__init__(*args, **kwargs)
    self.period = period

@classmethod
def from_variables(self, variables, options):
    obj = super().from_variables(variables, options={})
    obj.period = options.get("period", obj.period)
    return obj

def _wrap_periodically(self, label_value):
    return self.index.min() + (label_value - self.index.max()) % self.period

def sel(
    self, labels: dict[Any, Any], method=None, tolerance=None
) -> IndexSelResult:
    """Remaps labels outside of the indexes' range back to integer indices inside the range."""

    assert len(labels) == 1
    coord_name, label = next(iter(labels.items()))

    if isinstance(label, slice):
        wrapped_label = slice(
            self._wrap_periodically(label.start),
            self._wrap_periodically(label.stop),
        )
    else:
        wrapped_label = self._wrap_periodically(label)

    return super().sel({coord_name: wrapped_label})

```

Note: I also added period as an option, which is supported in #6971 but not yet well documented. Another way to pass options is via coordinate attributes, like in this FunctionalIndex example.

It should work in most cases I think:

```python lon_coord = xr.DataArray(data=np.linspace(-180, 180, 19), dims="lon") da = xr.DataArray(data=np.random.randn(19), dims="lon", coords={"lon": lon_coord})

note the period set here

world = da.drop_indexes("lon").set_xindex("lon", index_cls=PeriodicBoundaryIndex, period=360) ```

```python world.sel(lon=200, method="nearest")

<xarray.DataArray ()>

array(-0.86583185)

Coordinates:

lon float64 -160.0

world.sel(lon=[200, 200], method="nearest")

<xarray.DataArray (lon: 2)>

array([-0.86583185, -0.86583185])

Coordinates:

* lon (lon) float64 -160.0 -160.0

world.sel(lon=slice(180, 200), method="nearest")

<xarray.DataArray (lon: 2)>

array([-1.59829997, -0.86583185])

Coordinates:

* lon (lon) float64 -180.0 -160.0

```

There's likely more things to do for slices as you point out. I don't think either that it's possible to pass two slices to isel. Not sure how this could be handled, but probably the easiest is to raise for cases like world.sel(lon=slice(170, 190)).

If we really need more flexibility in sel without copying the whole body of PandasIndex.sel, we could indeed refactor PandasIndex to allow more customization in subclasses. We must be careful, though, as it may be harder to make changes without possibly breaking 3rd-party stuff.

Or like you suggest we could define some _pre_process / _post_process hooks. It's not obvious where to call those hooks, though. Before or after converting from/to Variable or DataArray? Before or after checking for slices? array or scalar? The ideal place may change from one index to another.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Periodic Boundary Index 1372035441
1246944137 https://github.com/pydata/xarray/issues/7031#issuecomment-1246944137 https://api.github.com/repos/pydata/xarray/issues/7031 IC_kwDOAMm_X85KUtuJ benbovy 4160723 2022-09-14T15:30:59Z 2022-09-14T16:31:29Z MEMBER

My understanding from reading the docs was that every Dataset.meth calls the corresponding Index.meth.

Yes that's indeed what I've written in #6975 and I realize now that this is confusing, especially for isel.

So Dataset.sel calls Index.sel, but can also sometimes call Dataset.isel. But Dataset.isel does not call Index.isel, nor Index.sel.

So we can describe the implementation of Dataset.sel() as a two-step procedure:

  1. remap the input dictionary {coord_name: label_values} to a dictionary {dimension_name: int_positions}.

    • This is done via dispatching the input dictionary and calling Index.sel() for each of the relevant indexes found in Dataset.xindexes, and then merging all the returned results into a single output dictionary.
  2. pass the the dictionary {dimension_name: int_positions} to Dataset.isel().

    • Dataset.isel() will dispatch this input dictionary and call Variable.isel() for each variable in Dataset.variables and Index.isel() for each unique index in Dataset.xindexes.

This omits a few implementation details (special cases for multi-index), but that's basically how it works.

I think it would help if such "how label-based selection works in Xarray" high-level description was added somewhere in the "Xarray internals" documentation, along with other "how it works" sections for, e.g., alignment.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Periodic Boundary Index 1372035441
1246398771 https://github.com/pydata/xarray/issues/7031#issuecomment-1246398771 https://api.github.com/repos/pydata/xarray/issues/7031 IC_kwDOAMm_X85KSokz benbovy 4160723 2022-09-14T08:08:26Z 2022-09-14T08:08:26Z MEMBER

tl;dr: Xarray Index currently supports implementing periodic indexing for label-based indexing but not for location-based (integer) indexing.

There's a big difference now between isel and sel:

  • Dataset.isel() accepts dimension names only
  • Dataset.sel() accepts coordinate names (actually, it falls back to isel when giving dimension names with no coordinate, and I'm wondering if we shouldn't deprecate that?)

Index.isel() is convenient when the underlying index structure can be itself sliced (like pandas.Index objects), so that users don't need to do ds.isel(...).set_xindex(...) every time to explicitly rebuild an index after slicing the Dataset. For a kd-tree structure that may not be possible, i.e., KDTreeIndex.isel() would likely return None causing the index to be dropped in the result, so there would be no way around doing ds.isel(...).set_xindex(...).

Most coordinate and data variables are still sliced via Variable.isel(), which doesn't involve any index. That's why you get an IndexError in your example. (side note: the "index" / "indexing" terminology used everywhere, for both label and integer selection, is quite confusing but I'm not sure how this could be improved).

If we want to support periodic indexing with isel, we would have to implement that in Xarray itself. Alternatively, it might be possible to add some API in Index so that in the case of a periodic index it would return indxr % length from indxr, which Xarray will then pass to Variable.isel(). I'm not sure the latter is a good idea, though. Indexes may work with arbitrary coordinates and dimensions, which would make things too complex (handling conflicts, etc.). Also, I don't know if there's other potential use cases besides periodic indexing?

@TomNicholas your experiment makes it clear that the documentation on this part (#6975) should be improved. Thanks!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Periodic Boundary Index 1372035441

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 45.093ms · About: xarray-datasette