home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

6 rows where issue = 771382653 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • batterseapower 3
  • keewis 2
  • mathause 1

author_association 2

  • MEMBER 3
  • NONE 3

issue 1

  • Allow sel's method and tolerance to vary per-dimension · 6 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
748498256 https://github.com/pydata/xarray/issues/4714#issuecomment-748498256 https://api.github.com/repos/pydata/xarray/issues/4714 MDEyOklzc3VlQ29tbWVudDc0ODQ5ODI1Ng== keewis 14808389 2020-12-19T16:58:20Z 2020-12-19T16:58:20Z MEMBER

I think reindex would need to be changed

that's true, I only tried the special case where the data that would be used to do the forward fill is included in the result.

I guess this works but it's a bit cumbersome

yeah, to_dataset is probably not the right tool for pointwise indexing.

it does not fail if one of the sensors in the query list is missing

if I understand correctly, you would like to index with arbitrary values for time, but would like an error for missing values of sensor. Unfortunately, I don't think that is possible using a single call to sel. Instead, you could set the fill_value parameter of reindex to some other value (for example, -np.inf) and then drop these values after the pointwise indexing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow sel's method and tolerance to vary per-dimension 771382653
748486801 https://github.com/pydata/xarray/issues/4714#issuecomment-748486801 https://api.github.com/repos/pydata/xarray/issues/4714 MDEyOklzc3VlQ29tbWVudDc0ODQ4NjgwMQ== batterseapower 18488 2020-12-19T15:13:36Z 2020-12-19T15:14:59Z NONE

Thanks for the response. I think reindex would need to be changed as well because this code:

python sensor_data.reindex({ 'time': [1], 'sensor': ['A', 'B'] }, method='ffill')

Is not equivalent to this code: python sensor_data.reindex({ 'time': [1], 'sensor': ['A', 'B'] }).ffill(dim='time').ffill(dim='sensor')

So if I understand your to_dataset idea correctly, you are proposing:

python ds = sensor_data.to_dataset(dim='sensor') xr.concat([ ds[sensor].sel({'time': time}, method='ffill', drop=True) for sensor, time in zip(['A', 'A', 'A', 'B', 'C'], [0, 1, 2, 0, 0]) ], dim='sample')

I guess this works but it's a bit cumbersome and unlikely to be fast. I think there must be something I'm not understanding here - I'm not familiar with all the nuances of the xarray api.

Your idea of reindex followed by sel is an interesting one, but it does do something slightly different than what I was asking for: it does not fail if one of the sensors in the query list is missing, but rather inserts a NaN. I suppose you could fix this by doing an extra check afterwards, assuming that your original pre-reindex data contained no NaNs.

In general min(S*N,T*N) could be much larger than S*T, so for big queries it's quite possible that you wouldn't have enough space to allocate the intermediate even if you could fit 100s of copies of the original S*T matrix. Using a dask cluster would make this situation less likely of course, but it seems like it would be better to avoid all this copying (even on a beefy cluster) even if just for performance reasons.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow sel's method and tolerance to vary per-dimension 771382653
748483357 https://github.com/pydata/xarray/issues/4714#issuecomment-748483357 https://api.github.com/repos/pydata/xarray/issues/4714 MDEyOklzc3VlQ29tbWVudDc0ODQ4MzM1Nw== keewis 14808389 2020-12-19T14:41:47Z 2020-12-19T14:41:47Z MEMBER

reindex does not have to be changed since we can just call e.g. ffill with the dim parameter for this to work: python arr.reindex(...).ffill(dim="dim")

This really depends on how you intend to use the result of the indexing. For example, if you don't really need the big matrix, you could just convert the DataArray to a Dataset where the sensor dimension is the names of the variables (using to_dataset(dim="sensor"), or construct it that way). If you do need the matrix, this might be slightly better (you still end up allocating a T * (S + n) array): python arr.reindex(sensor=["A", "B", "C"]).sel({"sensor": ..., "time": ...}, method="ffill") but if you really care about the memory allocated at once, you might be better off using dask: python arr.chunk({"time": 100}).reindex(...).sel(...)

If all of that is not an option, I guess we might be able add a method_kwargs parameter (not sure if there is a better option, though).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow sel's method and tolerance to vary per-dimension 771382653
748479287 https://github.com/pydata/xarray/issues/4714#issuecomment-748479287 https://api.github.com/repos/pydata/xarray/issues/4714 MDEyOklzc3VlQ29tbWVudDc0ODQ3OTI4Nw== batterseapower 18488 2020-12-19T14:06:36Z 2020-12-19T14:06:36Z NONE

Thanks for the suggestion. One issue with this alternative is it creates a potentially large intermediate object.

If you have T times and S sensors, and want to sample them at N (time, sensor) pairs, then the intermediate object with your approach has size T*N (if you index sensors first) or S*N (if you index time first). If you can index both dimensions in one sel call then we should only need to allocate memory for the result of size N, which is considerably better.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow sel's method and tolerance to vary per-dimension 771382653
748478029 https://github.com/pydata/xarray/issues/4714#issuecomment-748478029 https://api.github.com/repos/pydata/xarray/issues/4714 MDEyOklzc3VlQ29tbWVudDc0ODQ3ODAyOQ== mathause 10194086 2020-12-19T13:55:07Z 2020-12-19T13:55:07Z MEMBER

Could you split it in two calls or does this not do what you want?

python sensor_data.sel({ 'sensor': xr.DataArray(['A', 'A', 'A', 'B', 'C'], dims=['sample'])}).sel({ 'time': xr.DataArray([0, 1, 2, 0, 0], dims=['sample']) }, method='ffill')

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow sel's method and tolerance to vary per-dimension 771382653
748477889 https://github.com/pydata/xarray/issues/4714#issuecomment-748477889 https://api.github.com/repos/pydata/xarray/issues/4714 MDEyOklzc3VlQ29tbWVudDc0ODQ3Nzg4OQ== batterseapower 18488 2020-12-19T13:53:53Z 2020-12-19T13:53:53Z NONE

I guess it would also make sense to have this in reindex if you did decide to add it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow sel's method and tolerance to vary per-dimension 771382653

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.552ms · About: xarray-datasette