home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

20 rows where issue = 842940980 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • ahuang11 8
  • max-sixty 6
  • shoyer 5
  • pep8speaks 1

author_association 3

  • MEMBER 11
  • CONTRIBUTOR 8
  • NONE 1

issue 1

  • Add drop duplicates · 20 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
830508669 https://github.com/pydata/xarray/pull/5089#issuecomment-830508669 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgzMDUwODY2OQ== ahuang11 15331990 2021-05-01T03:25:47Z 2021-05-01T03:25:47Z CONTRIBUTOR

I failed to commit properly so see https://github.com/pydata/xarray/pull/5239 where I only do drop duplicates for dims

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
809046639 https://github.com/pydata/xarray/pull/5089#issuecomment-809046639 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgwOTA0NjYzOQ== pep8speaks 24736507 2021-03-29T03:51:19Z 2021-05-01T03:07:57Z NONE

Hello @ahuang11! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

  • In the file xarray/core/dataset.py:

Line 7260:29: F821 undefined name '_get_func_args' Line 7261:43: F821 undefined name '_initialize_curvefit_params' Line 7326:1: E305 expected 2 blank lines after class or function definition, found 1

Comment last updated at 2021-05-01 03:07:57 UTC
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
830274959 https://github.com/pydata/xarray/pull/5089#issuecomment-830274959 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgzMDI3NDk1OQ== ahuang11 15331990 2021-04-30T18:16:59Z 2021-04-30T18:17:21Z CONTRIBUTOR

I can take a look this weekend. If narrow, could simply rollback to this commit, make minor adjustments and merge. https://github.com/pydata/xarray/pull/5089/commits/28aa96ab13db72bfa6ad8b156c2c720b49ec9a04

But I personally prefer full so it'd be nice if we could come to a consensus on how to handle it~

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
830237579 https://github.com/pydata/xarray/pull/5089#issuecomment-830237579 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgzMDIzNzU3OQ== max-sixty 5635139 2021-04-30T17:12:02Z 2021-04-30T17:12:02Z MEMBER

This is great work and it would be good to get this in for the upcoming release https://github.com/pydata/xarray/issues/5232.

I think there are two paths: 1. Narrow: merge the functionality which works along 1D dimensioned coords 2. Full: Ensure we're at consensus on how we handle >1D coords

I would mildly vote for narrow. While I would also vote to merge it as-is, I think it's not a huge task to move wide onto a new branch.

@ahuang11 what are your thoughts?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
824501790 https://github.com/pydata/xarray/pull/5089#issuecomment-824501790 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyNDUwMTc5MA== shoyer 1217238 2021-04-22T02:58:53Z 2021-04-22T02:58:53Z MEMBER

A couple thoughts on strategy here:

  1. Let's consider starting with a minimal set of functionality (e.g., only drop duplicates in a single variable and/or along only one dimension). This is easier to merge and provides a good foundation for implementing the remaining features in follow-on PRs.
  2. It might be useful to start from the foundation of implementing multi-dimensional indexing with a boolean array (https://github.com/pydata/xarray/issues/1887). Then drop_duplicates() (and also unique()) could just be a layer on top of that, passing in a boolean index of "non-duplicate" entries.
{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
822098673 https://github.com/pydata/xarray/pull/5089#issuecomment-822098673 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyMjA5ODY3Mw== max-sixty 5635139 2021-04-19T00:41:47Z 2021-04-19T00:41:47Z MEMBER

@max-sixty is there a case where you don't think we could do a single isel? I'd love to do the single isel() call if possible, because that should have the best performance by far.

IIUC there are two broad cases here - where every supplied coord is a dimensioned coord — it's v simple, just isel non-duplicates for each dimension* - where there's a non-dimensioned coord with ndim > 1, then it requires stacking; e.g. the example above. Is there a different way of doing this?

```python In [12]: da Out[12]: <xarray.DataArray (init: 2, tau: 3)> array([[1, 2, 3], [4, 5, 6]]) Coordinates: * init (init) int64 0 1 * tau (tau) int64 1 2 3 valid (init, tau) int64 8 6 6 7 7 7

In [13]: da.drop_duplicate_coords("valid") Out[13]: <xarray.DataArray (valid: 3)> array([1, 2, 4]) Coordinates: * valid (valid) int64 8 6 7 init (valid) int64 0 0 1 tau (valid) int64 1 2 1 ```

* very close to this is a 1D non-dimensioned coord, in which case we can either turn it into a dimensioned coord or retain the existing dimensioned coords — I think probably the former if we allow the stacking case, for the sake of consistency.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
822096265 https://github.com/pydata/xarray/pull/5089#issuecomment-822096265 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyMjA5NjI2NQ== shoyer 1217238 2021-04-19T00:29:17Z 2021-04-19T00:29:17Z MEMBER

I agree with @shoyer that we could do it in a single isel in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.

Yes correct. I am not feeling well at the moment so I probably won't get to this today, but feel free to make commits!

I hope you feel well soon here! There is no time pressure from our end on this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
822094033 https://github.com/pydata/xarray/pull/5089#issuecomment-822094033 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyMjA5NDAzMw== ahuang11 15331990 2021-04-19T00:19:17Z 2021-04-19T00:19:17Z CONTRIBUTOR

@ahuang11 IIUC, this is only using .stack where it needs to actually stack the array, is that correct? So a list of dims is passed (rather than non-dim coords), then it's not stacking.

I agree with @shoyer that we could do it in a single isel in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.

Yes correct. I am not feeling well at the moment so I probably won't get to this today, but feel free to make commits!

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
822092468 https://github.com/pydata/xarray/pull/5089#issuecomment-822092468 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyMjA5MjQ2OA== shoyer 1217238 2021-04-19T00:12:20Z 2021-04-19T00:12:20Z MEMBER

@max-sixty is there a case where you don't think we could do a single isel? I'd love to do the single isel() call if possible, because that should have the best performance by far.

I guess this may come down to the desired behavior for multiple arguments, e.g., drop_duplicates(['lat', 'lon'])? I'm not certain that this case is well defined in this PR (it certainly needs more tests!).

I think we could make this work via the axis argument to np.unique, although the lack of support for object arrays could be problematic for us, since we put strings in object arrays.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
822089198 https://github.com/pydata/xarray/pull/5089#issuecomment-822089198 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyMjA4OTE5OA== max-sixty 5635139 2021-04-18T23:57:20Z 2021-04-18T23:57:20Z MEMBER

@ahuang11 IIUC, this is only using .stack where it needs to actually stack the array, is that correct? So a list of dims is passed (rather than non-dim coords), then it's not stacking.

I agree with @shoyer that we could do it in a single isel in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
821939594 https://github.com/pydata/xarray/pull/5089#issuecomment-821939594 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyMTkzOTU5NA== shoyer 1217238 2021-04-18T05:58:49Z 2021-04-18T05:58:49Z MEMBER

This looks great, but I wonder if we could simplify the implementation? For example, could we get away with only doing a single isel() for selecting the positions corresponding to unique values, rather than the current loop? .stack() can also be expensive relative to indexing.

This might require using a different routine to find the unique positions the current calls to duplicated() on a pandas.Index. I think we could construct the necessary indices even for multi-dimensional arrays using np.unique with return_index=True and np.unravel_index.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
821902582 https://github.com/pydata/xarray/pull/5089#issuecomment-821902582 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgyMTkwMjU4Mg== max-sixty 5635139 2021-04-17T23:37:07Z 2021-04-17T23:37:07Z MEMBER

Hi @ahuang11 — forgive the delay. We discussed this with the team on our call and think it would be a welcome addition, so thank you for contributing.

I took another look through the tests and the behavior looks ideal for dimensioned coords are passed:

```python In [6]: da Out[6]: <xarray.DataArray (lat: 5, lon: 5)> array([[ 0, 0, 0, 0, 0], [ 0, 1, 2, 3, 4], [ 0, 2, 4, 6, 8], [ 0, 3, 6, 9, 12], [ 0, 4, 8, 12, 16]]) Coordinates: * lat (lat) int64 0 1 2 2 3 * lon (lon) int64 0 1 3 3 4

In [7]: result = da.drop_duplicate_coords(["lat", "lon"], keep='first')

In [8]: result Out[8]: <xarray.DataArray (lat: 4, lon: 4)> array([[ 0, 0, 0, 0], [ 0, 1, 2, 4], [ 0, 2, 4, 8], [ 0, 4, 8, 16]]) Coordinates: * lat (lat) int64 0 1 2 3 * lon (lon) int64 0 1 3 4 ```

And I think this is also the best we can do for non-dimensioned coords. One thing I call out is that: a. The array is stacked for any non-dim coord > 1 dim b. The supplied coord becomes the new dimensioned coord

e.g. Stacking:

```python

In [12]: da Out[12]: <xarray.DataArray (init: 2, tau: 3)> array([[1, 2, 3], [4, 5, 6]]) Coordinates: * init (init) int64 0 1 * tau (tau) int64 1 2 3 valid (init, tau) int64 8 6 6 7 7 7

In [13]: da.drop_duplicate_coords("valid") Out[13]: <xarray.DataArray (valid: 3)> array([1, 2, 4]) Coordinates: * valid (valid) int64 8 6 7 init (valid) int64 0 0 1 tau (valid) int64 1 2 1 ```

Changing the dimensions: zeta becoming the new dimension, from tau:

```python

In [16]: ( ...: da ...: .assign_coords(dict(zeta=(('tau'),[4,4,6]))) ...: .drop_duplicate_coords('zeta') ...: ) Out[16]: <xarray.DataArray (init: 2, zeta: 2)> array([[1, 3], [4, 6]]) Coordinates: * init (init) int64 0 1 valid (init, zeta) int64 8 6 7 7 * zeta (zeta) int64 4 6 tau (zeta) int64 1 3 ```

One peculiarity — though I think a necessary one — is that the order matters in some cases:

```python

In [17]: ( ...: da ...: .assign_coords(dict(zeta=(('tau'),[4,4,6]))) ...: .drop_duplicate_coords(['zeta','valid']) ...: ) Out[17]: <xarray.DataArray (valid: 3)> array([1, 3, 4]) Coordinates: * valid (valid) int64 8 6 7 tau (valid) int64 1 3 1 init (valid) int64 0 0 1 zeta (valid) int64 4 6 4

In [18]: ( ...: da ...: .assign_coords(dict(zeta=(('tau'),[4,4,6]))) ...: .drop_duplicate_coords(['valid','zeta']) ...: ) Out[18]: <xarray.DataArray (zeta: 1)> array([1]) Coordinates: * zeta (zeta) int64 4 init (zeta) int64 0 tau (zeta) int64 1 valid (zeta) int64 8 ```

Unless anyone has any more thoughts, let's plan to merge this over the next few days. Thanks again @ahuang11 !

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
813782823 https://github.com/pydata/xarray/pull/5089#issuecomment-813782823 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgxMzc4MjgyMw== ahuang11 15331990 2021-04-06T02:48:00Z 2021-04-06T02:48:00Z CONTRIBUTOR

Not sure if there's a more elegant way of implementing this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
813179511 https://github.com/pydata/xarray/pull/5089#issuecomment-813179511 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgxMzE3OTUxMQ== ahuang11 15331990 2021-04-05T04:48:09Z 2021-04-05T04:48:09Z CONTRIBUTOR

Oh I just saw the edits with keeping the dims. I guess that would work.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
813169922 https://github.com/pydata/xarray/pull/5089#issuecomment-813169922 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgxMzE2OTkyMg== ahuang11 15331990 2021-04-05T04:09:26Z 2021-04-05T04:09:26Z CONTRIBUTOR

I prefer drop duplicate values to be under the unique() PR; maybe could be renamed as drop_duplicate_values().

Also I think preserving existing dimensions is more powerful than flattening the dimensions.

On Sun, Apr 4, 2021, 11:01 PM Stephan Hoyer @.***> wrote:

From an API perspective, I think the name drop_duplicates() would be fine. I would guess that handling arbitrary variables in a Dataset would not be any harder than handling only coordinates?

One thing that is a little puzzling to me is how deduplicating across multiple dimensions is handled. It looks like this function preserves existing dimensions, but inserts NA is the arrays would be ragged? This seems a little strange to me. I think it could make more sense to "flatten" all dimensions in the contained variables into a new dimension when dropping duplicates.

This would require specifying the name for the new dimension(s), but perhaps that could work by switching to the de-duplicated variable name? For example, ds.drop_duplicates('valid') on the example in the PR description would result in a "valid" coordinate/dimension of length 3.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/5089#issuecomment-813168052, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADU7FFWCT2NXOR2AYNLGVQDTHEYYFANCNFSM4Z6ZAMUA .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
813168052 https://github.com/pydata/xarray/pull/5089#issuecomment-813168052 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgxMzE2ODA1Mg== shoyer 1217238 2021-04-05T04:00:54Z 2021-04-05T04:05:16Z MEMBER

From an API perspective, I think the name drop_duplicates() would be fine. I would guess that handling arbitrary variables in a Dataset would not be any harder than handling only coordinates?

One thing that is a little puzzling to me is how deduplicating across multiple dimensions is handled. It looks like this function preserves existing dimensions, but inserts NA is the arrays would be ragged? This seems a little strange to me. I think it could make more sense to "flatten" all dimensions in the contained variables into a new dimension when dropping duplicates.

This would require specifying the name for the new dimension(s), but perhaps that could work by switching to the de-duplicated variable name? For example, ds.drop_duplicates('valid') on the example in the PR description would result in a "valid" coordinate/dimension of length 3. The original 'init' and 'tau' dimensions could be preserved as coordinates, e.g., python ds = xr.DataArray( [[1, 2, 3], [4, 5, 6]], coords={"init": [0, 1], "tau": [1, 2, 3]}, dims=["init", "tau"], ).to_dataset(name="test") ds.coords["valid"] = (("init", "tau"), np.array([[8, 6, 6], [7, 7, 7]])) result = ds.drop_duplicates('valid') would result in: ```

result <xarray.Dataset> Dimensions: (valid: 3) Coordinates: init (valid) int64 0 0 1 tau (valid) int64 1 2 1 * valid (valid) int64 8 6 7 Data variables: test (valid) int64 1 2 4 `` i.e., the exact same thing that would be obtained by indexing with the positions of the de-duplicated values:ds.isel(init=('valid', [0, 0, 1]), tau=('valid', [0, 1, 0]))`.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
813109553 https://github.com/pydata/xarray/pull/5089#issuecomment-813109553 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgxMzEwOTU1Mw== max-sixty 5635139 2021-04-04T22:35:15Z 2021-04-04T22:35:15Z MEMBER

If we don't hear anything, let's add this to the top of the list for the next dev call in ten days

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
809814710 https://github.com/pydata/xarray/pull/5089#issuecomment-809814710 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgwOTgxNDcxMA== ahuang11 15331990 2021-03-30T00:27:20Z 2021-04-04T22:26:02Z CONTRIBUTOR

Thanks for the PR @ahuang11 !

I think the method could be really useful. Does anyone else have thoughts?

One important decision is whether this should operate on dimensioned coords or all coords (or even any array?). My guess would be that we could start with dimensioned coords given those are the most likely use case, and we could extent to non-dimensioned coords later.

(here's a glossary as the terms can get confusing: http://xarray.pydata.org/en/stable/terminology.html)

~~Let's start with just dims for now.~~

Okay, since I had some time, I decided to do coords too.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
811203549 https://github.com/pydata/xarray/pull/5089#issuecomment-811203549 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgxMTIwMzU0OQ== max-sixty 5635139 2021-03-31T16:23:22Z 2021-03-31T16:23:22Z MEMBER

@pydata/xarray we didn't get to this on the call today — two questions from @mathause : - should we have dims=None default to all dims? Or are we gradually transitioning to dims=... for all dims? - Is drop_duplicates a good name? Or should it explicitly refer to dropping duplicates on the index?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980
809822634 https://github.com/pydata/xarray/pull/5089#issuecomment-809822634 https://api.github.com/repos/pydata/xarray/issues/5089 MDEyOklzc3VlQ29tbWVudDgwOTgyMjYzNA== ahuang11 15331990 2021-03-30T00:52:18Z 2021-03-30T00:52:18Z CONTRIBUTOR

Not sure how to fix this: ```

xarray/core/dataset.py:7111: error: Keywords must be strings Found 1 error in 1 file (checked 138 source files) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add drop duplicates 842940980

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.064ms · About: xarray-datasette