home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

4 rows where user = 18488 sorted by updated_at descending

✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 2

  • Allow sel's method and tolerance to vary per-dimension 3
  • Multidimensional reindex 1

user 1

  • batterseapower · 4 ✖

author_association 1

  • NONE 4
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
748491929 https://github.com/pydata/xarray/issues/1553#issuecomment-748491929 https://api.github.com/repos/pydata/xarray/issues/1553 MDEyOklzc3VlQ29tbWVudDc0ODQ5MTkyOQ== batterseapower 18488 2020-12-19T16:00:00Z 2020-12-19T16:00:00Z NONE

For the case of a simple vectorized reindex you can work around the lack of a multi-dimensional DataArray.reindex by falling back on isel as follows:

``` def reindex_vectorized(da, indexers, method=None, tolerance=None, dim=None, fill_value=None): # Reindex does not presently support vectorized lookups: https://github.com/pydata/xarray/issues/1553 # Sel does (e.g. https://github.com/pydata/xarray/issues/4630) but can't handle missing keys

if dim is None:
    dim = 'dim_0'

if fill_value is None:
    fill_value = {'i': np.nan, 'f': np.nan}[da.dtype.kind]
dtype = np.result_type(fill_value, da.dtype)

if method is None:
    method = {}
elif not isinstance(method, dict):
    method = {dim: method for dim in da.dims}

if tolerance is None:
    tolerance = {}
elif not isinstance(tolerance, dict):
    tolerance = {dim: tolerance for dim in da.dims}

ixs = {}
masks = []
any_empty = False
for index_dim, index in indexers.items():
    ix = da.indexes[index_dim].get_indexer(index, method=method.get(index_dim), tolerance=tolerance.get(index_dim))
    ixs[index_dim] = xr.DataArray(np.fmax(0, ix), dims=[dim])
    masks.append(ix >= 0)
    any_empty = any_empty or (len(da.indexes[index_dim]) == 0)

mask = functools.reduce(lambda x, y: x & y, masks)

if any_empty and len(mask):
    # Unfortunately can't just isel with `ixs` in this special case, because we'll go out of bounds accessing index 0
    new_coords = {
        name: coord
        for name, coord in da.coords.items()
        # XXX: to match the other case we should really include coords with name in ixs too, but it's fiddly
        if name not in ixs
    }
    new_dims = [name for name in da.dims if name not in ixs] + [dim]
    result = xr.DataArray(
        data=np.broadcast_to(
            fill_value,
            tuple(n for name, n in da.sizes.items() if name not in ixs) + (len(mask),)
        ),
        coords=new_coords, dims=new_dims,
        name=da.name, attrs=da.attrs
    )
else:
    result = da[ixs]

    if not mask.all():
        result = result.astype(dtype, copy=False)
        result[{dim: ~mask}] = fill_value

return result

```

Example:

``` sensor_data = xr.DataArray(np.arange(6).reshape((3, 2)), coords=[ ('time', [0, 2, 3]), ('sensor', ['A', 'C']), ])

reindex_vectorized(sensor_data, { 'sensor': ['A', 'A', 'A', 'B', 'C'], 'time': [0, 1, 2, 0, 0], }, method={'time': 'ffill'})

[0, 0, 2, nan, 1]

reindex_vectorized(xr.DataArray(coords=[ ('sensor', []), ('time', [0, 2]) ]), { 'sensor': ['A', 'A', 'A', 'B', 'C'], 'time': [0, 1, 2, 0, 0], }, method={'time': 'ffill'})

[nan, nan, nan, nan, nan]

```

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional reindex 254927382
748486801 https://github.com/pydata/xarray/issues/4714#issuecomment-748486801 https://api.github.com/repos/pydata/xarray/issues/4714 MDEyOklzc3VlQ29tbWVudDc0ODQ4NjgwMQ== batterseapower 18488 2020-12-19T15:13:36Z 2020-12-19T15:14:59Z NONE

Thanks for the response. I think reindex would need to be changed as well because this code:

python sensor_data.reindex({ 'time': [1], 'sensor': ['A', 'B'] }, method='ffill')

Is not equivalent to this code: python sensor_data.reindex({ 'time': [1], 'sensor': ['A', 'B'] }).ffill(dim='time').ffill(dim='sensor')

So if I understand your to_dataset idea correctly, you are proposing:

python ds = sensor_data.to_dataset(dim='sensor') xr.concat([ ds[sensor].sel({'time': time}, method='ffill', drop=True) for sensor, time in zip(['A', 'A', 'A', 'B', 'C'], [0, 1, 2, 0, 0]) ], dim='sample')

I guess this works but it's a bit cumbersome and unlikely to be fast. I think there must be something I'm not understanding here - I'm not familiar with all the nuances of the xarray api.

Your idea of reindex followed by sel is an interesting one, but it does do something slightly different than what I was asking for: it does not fail if one of the sensors in the query list is missing, but rather inserts a NaN. I suppose you could fix this by doing an extra check afterwards, assuming that your original pre-reindex data contained no NaNs.

In general min(S*N,T*N) could be much larger than S*T, so for big queries it's quite possible that you wouldn't have enough space to allocate the intermediate even if you could fit 100s of copies of the original S*T matrix. Using a dask cluster would make this situation less likely of course, but it seems like it would be better to avoid all this copying (even on a beefy cluster) even if just for performance reasons.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow sel's method and tolerance to vary per-dimension 771382653
748479287 https://github.com/pydata/xarray/issues/4714#issuecomment-748479287 https://api.github.com/repos/pydata/xarray/issues/4714 MDEyOklzc3VlQ29tbWVudDc0ODQ3OTI4Nw== batterseapower 18488 2020-12-19T14:06:36Z 2020-12-19T14:06:36Z NONE

Thanks for the suggestion. One issue with this alternative is it creates a potentially large intermediate object.

If you have T times and S sensors, and want to sample them at N (time, sensor) pairs, then the intermediate object with your approach has size T*N (if you index sensors first) or S*N (if you index time first). If you can index both dimensions in one sel call then we should only need to allocate memory for the result of size N, which is considerably better.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow sel's method and tolerance to vary per-dimension 771382653
748477889 https://github.com/pydata/xarray/issues/4714#issuecomment-748477889 https://api.github.com/repos/pydata/xarray/issues/4714 MDEyOklzc3VlQ29tbWVudDc0ODQ3Nzg4OQ== batterseapower 18488 2020-12-19T13:53:53Z 2020-12-19T13:53:53Z NONE

I guess it would also make sense to have this in reindex if you did decide to add it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow sel's method and tolerance to vary per-dimension 771382653

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1285.218ms · About: xarray-datasette
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows