home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where issue = 842940980 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

user 1

  • shoyer · 5 ✖

issue 1

  • Add drop duplicates · 5 ✖

author_association 1

  • MEMBER 5
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
824501790 https://github.com/pydata/xarray/pull/5089#issuecomment-824501790 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyNDUwMTc5MA== shoyer 1217238 2021-04-22T02:58:53Z 2021-04-22T02:58:53Z MEMBER

A couple thoughts on strategy here:

  1. Let's consider starting with a minimal set of functionality (e.g., only drop duplicates in a single variable and/or along only one dimension). This is easier to merge and provides a good foundation for implementing the remaining features in follow-on PRs.
  2. It might be useful to start from the foundation of implementing multi-dimensional indexing with a boolean array (https://github.com/pydata/xarray/issues/1887). Then drop_duplicates() (and also unique()) could just be a layer on top of that, passing in a boolean index of "non-duplicate" entries.
{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
822096265 https://github.com/pydata/xarray/pull/5089#issuecomment-822096265 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyMjA5NjI2NQ== shoyer 1217238 2021-04-19T00:29:17Z 2021-04-19T00:29:17Z MEMBER

I agree with @shoyer that we could do it in a single isel in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.

Yes correct. I am not feeling well at the moment so I probably won't get to this today, but feel free to make commits!

I hope you feel well soon here! There is no time pressure from our end on this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
822092468 https://github.com/pydata/xarray/pull/5089#issuecomment-822092468 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyMjA5MjQ2OA== shoyer 1217238 2021-04-19T00:12:20Z 2021-04-19T00:12:20Z MEMBER

@max-sixty is there a case where you don't think we could do a single isel? I'd love to do the single isel() call if possible, because that should have the best performance by far.

I guess this may come down to the desired behavior for multiple arguments, e.g., drop_duplicates(['lat', 'lon'])? I'm not certain that this case is well defined in this PR (it certainly needs more tests!).

I think we could make this work via the axis argument to np.unique, although the lack of support for object arrays could be problematic for us, since we put strings in object arrays.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
821939594 https://github.com/pydata/xarray/pull/5089#issuecomment-821939594 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyMTkzOTU5NA== shoyer 1217238 2021-04-18T05:58:49Z 2021-04-18T05:58:49Z MEMBER

This looks great, but I wonder if we could simplify the implementation? For example, could we get away with only doing a single isel() for selecting the positions corresponding to unique values, rather than the current loop? .stack() can also be expensive relative to indexing.

This might require using a different routine to find the unique positions the current calls to duplicated() on a pandas.Index. I think we could construct the necessary indices even for multi-dimensional arrays using np.unique with return_index=True and np.unravel_index.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
813168052 https://github.com/pydata/xarray/pull/5089#issuecomment-813168052 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgxMzE2ODA1Mg== shoyer 1217238 2021-04-05T04:00:54Z 2021-04-05T04:05:16Z MEMBER

From an API perspective, I think the name drop_duplicates() would be fine. I would guess that handling arbitrary variables in a Dataset would not be any harder than handling only coordinates?

One thing that is a little puzzling to me is how deduplicating across multiple dimensions is handled. It looks like this function preserves existing dimensions, but inserts NA is the arrays would be ragged? This seems a little strange to me. I think it could make more sense to "flatten" all dimensions in the contained variables into a new dimension when dropping duplicates.

This would require specifying the name for the new dimension(s), but perhaps that could work by switching to the de-duplicated variable name? For example, ds.drop_duplicates('valid') on the example in the PR description would result in a "valid" coordinate/dimension of length 3. The original 'init' and 'tau' dimensions could be preserved as coordinates, e.g., python ds = xr.DataArray( [[1, 2, 3], [4, 5, 6]], coords={"init": [0, 1], "tau": [1, 2, 3]}, dims=["init", "tau"], ).to_dataset(name="test") ds.coords["valid"] = (("init", "tau"), np.array([[8, 6, 6], [7, 7, 7]])) result = ds.drop_duplicates('valid') would result in: ```

result <xarray.Dataset> Dimensions: (valid: 3) Coordinates: init (valid) int64 0 0 1 tau (valid) int64 1 2 1 * valid (valid) int64 8 6 7 Data variables: test (valid) int64 1 2 4 `` i.e., the exact same thing that would be obtained by indexing with the positions of the de-duplicated values:ds.isel(init=('valid', [0, 0, 1]), tau=('valid', [0, 1, 0]))`.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 5616.274ms · About: xarray-datasette