html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/5089#issuecomment-830237579,https://api.github.com/repos/pydata/xarray/issues/5089,830237579,MDEyOklzc3VlQ29tbWVudDgzMDIzNzU3OQ==,5635139,2021-04-30T17:12:02Z,2021-04-30T17:12:02Z,MEMBER,"This is great work and it would be good to get this in for the upcoming release https://github.com/pydata/xarray/issues/5232.
I think there are two paths:
1. Narrow: merge the functionality which works along 1D dimensioned coords
2. Full: Ensure we're at consensus on how we handle >1D coords
I would mildly vote for narrow. While I would also vote to merge it as-is, I think it's not a huge task to move wide onto a new branch.
@ahuang11 what are your thoughts?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,842940980
https://github.com/pydata/xarray/pull/5089#issuecomment-824501790,https://api.github.com/repos/pydata/xarray/issues/5089,824501790,MDEyOklzc3VlQ29tbWVudDgyNDUwMTc5MA==,1217238,2021-04-22T02:58:53Z,2021-04-22T02:58:53Z,MEMBER,"A couple thoughts on strategy here:
1. Let's consider starting with a minimal set of functionality (e.g., only drop duplicates in a single variable and/or along only one dimension). This is easier to merge and provides a good foundation for implementing the remaining features in follow-on PRs.
2. It might be useful to start from the foundation of implementing multi-dimensional indexing with a boolean array (https://github.com/pydata/xarray/issues/1887). Then `drop_duplicates()` (and also `unique()`) could just be a layer on top of that, passing in a boolean index of ""non-duplicate"" entries.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,842940980
https://github.com/pydata/xarray/pull/5089#issuecomment-822098673,https://api.github.com/repos/pydata/xarray/issues/5089,822098673,MDEyOklzc3VlQ29tbWVudDgyMjA5ODY3Mw==,5635139,2021-04-19T00:41:47Z,2021-04-19T00:41:47Z,MEMBER,"> @max-sixty is there a case where you don't think we could do a single `isel`? I'd love to do the single `isel()` call if possible, because that should have the best performance by far.
IIUC there are two broad cases here
- where every supplied coord is a dimensioned coord — it's v simple, just isel non-duplicates for each dimension*
- where there's a non-dimensioned coord with ndim > 1, then it requires stacking; e.g. the example above. Is there a different way of doing this?
```python
In [12]: da
Out[12]:
array([[1, 2, 3],
[4, 5, 6]])
Coordinates:
* init (init) int64 0 1
* tau (tau) int64 1 2 3
valid (init, tau) int64 8 6 6 7 7 7
In [13]: da.drop_duplicate_coords(""valid"")
Out[13]:
array([1, 2, 4])
Coordinates:
* valid (valid) int64 8 6 7
init (valid) int64 0 0 1
tau (valid) int64 1 2 1
```
\* very close to this is a 1D non-dimensioned coord, in which case we can either turn it into a dimensioned coord or retain the existing dimensioned coords — I think probably the former if we allow the stacking case, for the sake of consistency.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,842940980
https://github.com/pydata/xarray/pull/5089#issuecomment-822096265,https://api.github.com/repos/pydata/xarray/issues/5089,822096265,MDEyOklzc3VlQ29tbWVudDgyMjA5NjI2NQ==,1217238,2021-04-19T00:29:17Z,2021-04-19T00:29:17Z,MEMBER,"> > I agree with @shoyer that we could do it in a single `isel` in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.
>
> Yes correct. I am not feeling well at the moment so I probably won't get to this today, but feel free to make commits!
I hope you feel well soon here! There is no time pressure from our end on this.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,842940980
https://github.com/pydata/xarray/pull/5089#issuecomment-822092468,https://api.github.com/repos/pydata/xarray/issues/5089,822092468,MDEyOklzc3VlQ29tbWVudDgyMjA5MjQ2OA==,1217238,2021-04-19T00:12:20Z,2021-04-19T00:12:20Z,MEMBER,"@max-sixty is there a case where you don't think we could do a single `isel`? I'd love to do the single `isel()` call if possible, because that should have the best performance by far.
I guess this may come down to the desired behavior for multiple arguments, e.g., `drop_duplicates(['lat', 'lon'])`? I'm not certain that this case is well defined in this PR (it certainly needs more tests!).
I think we could make this work via the `axis` argument to `np.unique`, although the lack of support for object arrays could be problematic for us, since we put strings in object arrays.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,842940980
https://github.com/pydata/xarray/pull/5089#issuecomment-822089198,https://api.github.com/repos/pydata/xarray/issues/5089,822089198,MDEyOklzc3VlQ29tbWVudDgyMjA4OTE5OA==,5635139,2021-04-18T23:57:20Z,2021-04-18T23:57:20Z,MEMBER,"@ahuang11 IIUC, this is only using `.stack` where it needs to actually stack the array, is that correct? So a list of dims is passed (rather than non-dim coords), then it's not stacking.
I agree with @shoyer that we could do it in a single `isel` in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,842940980
https://github.com/pydata/xarray/pull/5089#issuecomment-821939594,https://api.github.com/repos/pydata/xarray/issues/5089,821939594,MDEyOklzc3VlQ29tbWVudDgyMTkzOTU5NA==,1217238,2021-04-18T05:58:49Z,2021-04-18T05:58:49Z,MEMBER,"This looks great, but I wonder if we could simplify the implementation? For example, could we get away with only doing a single isel() for selecting the positions corresponding to unique values, rather than the current loop? `.stack()` can also be expensive relative to indexing.
This might require using a different routine to find the unique positions the current calls to `duplicated()` on a `pandas.Index`. I think we could construct the necessary indices even for multi-dimensional arrays using `np.unique` with `return_index=True` and `np.unravel_index`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,842940980
https://github.com/pydata/xarray/pull/5089#issuecomment-821902582,https://api.github.com/repos/pydata/xarray/issues/5089,821902582,MDEyOklzc3VlQ29tbWVudDgyMTkwMjU4Mg==,5635139,2021-04-17T23:37:07Z,2021-04-17T23:37:07Z,MEMBER,"Hi @ahuang11 — forgive the delay. We discussed this with the team on our call and think it would be a welcome addition, so thank you for contributing.
I took another look through the tests and the behavior looks ideal for dimensioned coords are passed:
```python
In [6]: da
Out[6]:
array([[ 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4],
[ 0, 2, 4, 6, 8],
[ 0, 3, 6, 9, 12],
[ 0, 4, 8, 12, 16]])
Coordinates:
* lat (lat) int64 0 1 2 2 3
* lon (lon) int64 0 1 3 3 4
In [7]: result = da.drop_duplicate_coords([""lat"", ""lon""], keep='first')
In [8]: result
Out[8]:
array([[ 0, 0, 0, 0],
[ 0, 1, 2, 4],
[ 0, 2, 4, 8],
[ 0, 4, 8, 16]])
Coordinates:
* lat (lat) int64 0 1 2 3
* lon (lon) int64 0 1 3 4
```
And I _think_ this is also the best we can do for non-dimensioned coords. One thing I call out is that:
a. The array is stacked for any non-dim coord > 1 dim
b. The supplied coord becomes the new dimensioned coord
e.g. Stacking:
```python
In [12]: da
Out[12]:
array([[1, 2, 3],
[4, 5, 6]])
Coordinates:
* init (init) int64 0 1
* tau (tau) int64 1 2 3
valid (init, tau) int64 8 6 6 7 7 7
In [13]: da.drop_duplicate_coords(""valid"")
Out[13]:
array([1, 2, 4])
Coordinates:
* valid (valid) int64 8 6 7
init (valid) int64 0 0 1
tau (valid) int64 1 2 1
```
Changing the dimensions: `zeta` becoming the new dimension, from `tau`:
```python
In [16]: (
...: da
...: .assign_coords(dict(zeta=(('tau'),[4,4,6])))
...: .drop_duplicate_coords('zeta')
...: )
Out[16]:
array([[1, 3],
[4, 6]])
Coordinates:
* init (init) int64 0 1
valid (init, zeta) int64 8 6 7 7
* zeta (zeta) int64 4 6
tau (zeta) int64 1 3
```
One peculiarity — though I think a necessary one — is that the order matters in some cases:
```python
In [17]: (
...: da
...: .assign_coords(dict(zeta=(('tau'),[4,4,6])))
...: .drop_duplicate_coords(['zeta','valid'])
...: )
Out[17]:
array([1, 3, 4])
Coordinates:
* valid (valid) int64 8 6 7
tau (valid) int64 1 3 1
init (valid) int64 0 0 1
zeta (valid) int64 4 6 4
In [18]: (
...: da
...: .assign_coords(dict(zeta=(('tau'),[4,4,6])))
...: .drop_duplicate_coords(['valid','zeta'])
...: )
Out[18]:
array([1])
Coordinates:
* zeta (zeta) int64 4
init (zeta) int64 0
tau (zeta) int64 1
valid (zeta) int64 8
```
Unless anyone has any more thoughts, let's plan to merge this over the next few days. Thanks again @ahuang11 !","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,842940980
https://github.com/pydata/xarray/pull/5089#issuecomment-813168052,https://api.github.com/repos/pydata/xarray/issues/5089,813168052,MDEyOklzc3VlQ29tbWVudDgxMzE2ODA1Mg==,1217238,2021-04-05T04:00:54Z,2021-04-05T04:05:16Z,MEMBER,"From an API perspective, I think the name `drop_duplicates()` would be fine. I would guess that handling arbitrary variables in a Dataset would not be any harder than handling only coordinates?
One thing that is a little puzzling to me is how deduplicating across multiple dimensions is handled. It looks like this function preserves existing dimensions, but inserts NA is the arrays would be ragged? This seems a little strange to me. I think it could make more sense to ""flatten"" all dimensions in the contained variables into a new dimension when dropping duplicates.
This would require specifying the name for the new dimension(s), but perhaps that could work by switching to the de-duplicated variable name? For example, `ds.drop_duplicates('valid')` on the example in the PR description would result in a ""valid"" coordinate/dimension of length 3. The original `'init'` and `'tau'` dimensions could be preserved as coordinates, e.g.,
```python
ds = xr.DataArray(
[[1, 2, 3], [4, 5, 6]],
coords={""init"": [0, 1], ""tau"": [1, 2, 3]},
dims=[""init"", ""tau""],
).to_dataset(name=""test"")
ds.coords[""valid""] = ((""init"", ""tau""), np.array([[8, 6, 6], [7, 7, 7]]))
result = ds.drop_duplicates('valid')
```
would result in:
```
>>> result
Dimensions: (valid: 3)
Coordinates:
init (valid) int64 0 0 1
tau (valid) int64 1 2 1
* valid (valid) int64 8 6 7
Data variables:
test (valid) int64 1 2 4
```
i.e., the exact same thing that would be obtained by indexing with the positions of the de-duplicated values: `ds.isel(init=('valid', [0, 0, 1]), tau=('valid', [0, 1, 0]))`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,842940980
https://github.com/pydata/xarray/pull/5089#issuecomment-813109553,https://api.github.com/repos/pydata/xarray/issues/5089,813109553,MDEyOklzc3VlQ29tbWVudDgxMzEwOTU1Mw==,5635139,2021-04-04T22:35:15Z,2021-04-04T22:35:15Z,MEMBER,"If we don't hear anything, let's add this to the top of the list for the next dev call in ten days","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,842940980
https://github.com/pydata/xarray/pull/5089#issuecomment-811203549,https://api.github.com/repos/pydata/xarray/issues/5089,811203549,MDEyOklzc3VlQ29tbWVudDgxMTIwMzU0OQ==,5635139,2021-03-31T16:23:22Z,2021-03-31T16:23:22Z,MEMBER,"@pydata/xarray we didn't get to this on the call today — two questions from @mathause :
- should we have `dims=None` default to all dims? Or are we gradually transitioning to `dims=...` for all dims?
- Is `drop_duplicates` a good name? Or should it explicitly refer to dropping duplicates on the _index_?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,842940980