home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

60 rows where issue = 241578773 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • fujiisoup 33
  • shoyer 25
  • jhamman 2

issue 1

  • WIP: indexing with broadcasting · 60 ✖

author_association 1

  • MEMBER 60
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
337969983 https://github.com/pydata/xarray/pull/1473#issuecomment-337969983 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzNzk2OTk4Mw== shoyer 1217238 2017-10-19T16:53:28Z 2017-10-19T16:53:28Z MEMBER

I closed this intentionally since I think there is a good chance GitHub won't let you open a new PR otherwise.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
337969765 https://github.com/pydata/xarray/pull/1473#issuecomment-337969765 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzNzk2OTc2NQ== shoyer 1217238 2017-10-19T16:52:44Z 2017-10-19T16:52:44Z MEMBER

@fujiisoup Can you open a new pull request with this branch? I'd like to give you credit on GitHub for this (since you did most of the work), but I think if I merge this with "Squash and Merge" everything will get credited to me.

You can also try doing your own rebase to clean-up history into fewer commits if you like (or I could "squash and merge" locally in git), but I think the new PR would do a better job of preserving history anyone who wants to look at this later.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
337965865 https://github.com/pydata/xarray/pull/1473#issuecomment-337965865 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzNzk2NTg2NQ== jhamman 2443309 2017-10-19T16:38:35Z 2017-10-19T16:38:35Z MEMBER

LGTM.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
337805095 https://github.com/pydata/xarray/pull/1473#issuecomment-337805095 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzNzgwNTA5NQ== fujiisoup 6815844 2017-10-19T05:40:57Z 2017-10-19T05:40:57Z MEMBER

I'm happy with this :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
337771698 https://github.com/pydata/xarray/pull/1473#issuecomment-337771698 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzNzc3MTY5OA== shoyer 1217238 2017-10-19T01:16:48Z 2017-10-19T01:16:48Z MEMBER

I think this is ready to go in. @jhamman @fujiisoup any reason to wait?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
335319183 https://github.com/pydata/xarray/pull/1473#issuecomment-335319183 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzNTMxOTE4Mw== fujiisoup 6815844 2017-10-09T23:45:55Z 2017-10-10T12:12:04Z MEMBER

@shoyer, thanks for your review.

If we try to make behavior "intuitive" for 80% of use-cases, it only makes the remaining 20% more baffling and error prone.

OK. It makes sense also for me. Merging your PR.

I would be OK with adding a warning that there are still a few unresolved edge cases involving MultiIndex.

Actually, the vectorized label-indexing currently does not work almost entirely with MultiIndex. I think of the following cases where appropriate error messages are required,

python In [1]: import xarray as xr ...: import pandas as pd ...: ...: midx = pd.MultiIndex.from_tuples( ...: [(1, 'a'), (2, 'b'), (3, 'c')], ...: names=['x0', 'x1']) ...: da = xr.DataArray([0, 1, 2], dims=['x'], ...: coords={'x': midx}) ...: da ...: Out[1]: <xarray.DataArray (x: 3)> array([0, 1, 2]) Coordinates: * x (x) MultiIndex - x0 (x) int64 1 2 3 - x1 (x) object 'a' 'b' 'c'

  • da.sel(x=[(1, 'a'), (2, 'b')])
  • da.sel(x0='a')

works as expected,

  • da.sel(x0=[1, 2])

fail without appropriate error messages

  • da.sel(x=xr.DataArray([np.array(midx[:2]), np.array(midx[-2:])], dims=['y', 'z']))

destructs the MultiIndex structure silently.

I will add better Exceptions later today.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
335281238 https://github.com/pydata/xarray/pull/1473#issuecomment-335281238 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzNTI4MTIzOA== shoyer 1217238 2017-10-09T20:46:13Z 2017-10-09T20:46:13Z MEMBER

I will take a look at the multi-index issues. I suspect that many of these will be hard to resolve until we complete the refactor making indexes an explicit part the data model (https://github.com/pydata/xarray/issues/1603). It is really tricky to make things work reliably when MultiIndex levels are supposed to work like coordinates but use an entirely different mechanism. I would be OK with adding a warning that there are still a few unresolved edge cases involving MultiIndex.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
335279657 https://github.com/pydata/xarray/pull/1473#issuecomment-335279657 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzNTI3OTY1Nw== shoyer 1217238 2017-10-09T20:42:53Z 2017-10-09T20:42:53Z MEMBER

@fujiisoup Thanks again for all your hard work on this and for my slow response. I've made another PR with tweaks to your logic for conflicting coordinates: https://github.com/fujiisoup/xarray/pull/5

Mostly, my PR is about simplifying the logic by removing the special case work arounds you added that check object identity (things like this_arr is self._variables[k] and v.variable is cv.variable). My concern is that nothing else in xarray relies on these types of identity checks, so adding these additional rules will make the logic harder to understand and rely on programmatically. If we try to make behavior "intuitive" for 80% of use-cases, it only makes the remaining 20% more baffling and error prone. This is the similar to the problem with dimension re-ordering and NumPy's mixed integer/slice indexing. So I would rather these leave out for now at the cost of making the API slightly more cumbersome. As I think we discussed previously, it is easier to relax error conditions in the future than to add new errors.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
334024556 https://github.com/pydata/xarray/pull/1473#issuecomment-334024556 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzNDAyNDU1Ng== fujiisoup 6815844 2017-10-04T01:20:13Z 2017-10-04T01:20:13Z MEMBER

@jhamman Thanks for the review (and sorry for my late reply). I made some modifications.

@shoyer Do you have further comments about coordinate confliction?

Limitations of the current implementation are + Coordinate confliction and attachment related to reindex is still off. I think it should go with another PR. + I could not solve the 2nd issue of this comment. In your exampe, python mda.sel(x=xr.DataArray(mda.indexes['x'][:3], dims='x')) works as expected, but python mda.sel(x=xr.DataArray(mda.indexes['x'][:3], dims='z')) will attach coordinate z.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
331920149 https://github.com/pydata/xarray/pull/1473#issuecomment-331920149 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzMTkyMDE0OQ== fujiisoup 6815844 2017-09-25T15:35:03Z 2017-09-25T15:35:22Z MEMBER

I think it's ready. I appreciate any further comments.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
331909557 https://github.com/pydata/xarray/pull/1473#issuecomment-331909557 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzMTkwOTU1Nw== jhamman 2443309 2017-09-25T15:01:14Z 2017-09-25T15:01:14Z MEMBER

@shoyer and @fujiisoup - is this ready for a final review?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
331480985 https://github.com/pydata/xarray/pull/1473#issuecomment-331480985 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzMTQ4MDk4NQ== fujiisoup 6815844 2017-09-22T15:32:58Z 2017-09-22T15:32:58Z MEMBER

I think it would be better to update reindex method in another PR, as this PR is already large. So I ported this suggestion to #1553.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
330168943 https://github.com/pydata/xarray/pull/1473#issuecomment-330168943 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzMDE2ODk0Mw== fujiisoup 6815844 2017-09-18T09:27:38Z 2017-09-18T09:27:38Z MEMBER

Another case that might be confusing, ```python import numpy as np import xarray as xr

da = xr.DataArray(np.random.randn(3), dims=['x'], coords={'x': ['a', 'b', 'c']}) index_ds = xr.Dataset({}, coords={'x': [0, 1]})

this results in the coordinate confliction

da.isel(x=index_ds['x']) `` Asindex_ds['x']` has a coordinate of itself, it results in the coordinate confliction error, but it is clear that user does not want to attach it as a coordinate. I think in such a case, the indexer's coordinate should be silently dropped.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
330133651 https://github.com/pydata/xarray/pull/1473#issuecomment-330133651 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzMDEzMzY1MQ== fujiisoup 6815844 2017-09-18T05:49:26Z 2017-09-18T05:49:26Z MEMBER

For the second issue pointed out in this comment, I noticed it is due to the somewhat irregular xr.DataArray construction behavior with MultiIndex,

```python In [1]: import numpy as np ...: import xarray as xr ...: import pandas as pd ...: ...: midx = pd.MultiIndex.from_product([list('abc'), [0, 1]], ...: names=('one', 'two')) ...: # midx is automatically converted to a coordinate ...: xr.DataArray(midx[:3], dims='z') ...: Out[1]: <xarray.DataArray (z: 3)> array([('a', 0), ('a', 1), ('b', 0)], dtype=object) Coordinates: * z (z) MultiIndex - one (z) object 'a' 'a' 'b' - two (z) int64 0 1 0

In [2]: # If a coordinate is explicitly specified, midx will be a data ...: xr.DataArray(midx[:3], dims='z', coords={'z': [0, 1, 2]}) ...: Out[2]: <xarray.DataArray (z: 3)> array([('a', 0), ('a', 1), ('b', 0)], dtype=object) Coordinates: * z (z) int64 0 1 2 ```

I added tests for this case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
330048656 https://github.com/pydata/xarray/pull/1473#issuecomment-330048656 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzMDA0ODY1Ng== fujiisoup 6815844 2017-09-17T14:08:28Z 2017-09-17T14:08:35Z MEMBER

Xarray doesn't check names for other functionality

OK. I adopt isinstance(k, _ThisArray) rather than the name comparison.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
330023143 https://github.com/pydata/xarray/pull/1473#issuecomment-330023143 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzMDAyMzE0Mw== shoyer 1217238 2017-09-17T05:52:35Z 2017-09-17T05:52:35Z MEMBER

If k == self.name, drop the conflicted coordinate silently.

I appreciate the goal here, but this makes me a little nervous. Xarray doesn't check names for other functionality, besides deciding how to propagate names and cases where names are used to indicate how to convert a DataArray into a Dataset. So users aren't used to checking names to understand how code works.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
330022824 https://github.com/pydata/xarray/pull/1473#issuecomment-330022824 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzMDAyMjgyNA== fujiisoup 6815844 2017-09-17T05:42:01Z 2017-09-17T05:42:01Z MEMBER

How about the following case?

python target = Dataset({}, coords={'x': np.arange(3)}) indexer = DataArray([0, 1], dims=['x'], coords={'x': [2, 4]}) actual = target['x'].isel(x=indexer) Based on the above criteria, it will raise an IndexError, but I feel it should not raise an error as it is clear which one should preceds.

However, python target.isel(x=indexer) should raise an Error.

I would like to add an additional rule to take care of the first case, which might be valid only for DataArray + If k == self.name, drop the conflicted coordinate silently.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
330018569 https://github.com/pydata/xarray/pull/1473#issuecomment-330018569 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMzMDAxODU2OQ== fujiisoup 6815844 2017-09-17T04:17:28Z 2017-09-17T04:17:28Z MEMBER

Sorry for my late reply, and thanks for the information.

It occurs to me now that we actually have an pre-existing merge feature

It sounds great if we could use preexisting criteria (and maybe logic also). I will look inside xarray's merge logic deeply.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
329366409 https://github.com/pydata/xarray/pull/1473#issuecomment-329366409 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyOTM2NjQwOQ== shoyer 1217238 2017-09-14T04:16:22Z 2017-09-14T04:16:22Z MEMBER

It occurs to me now that we actually have an pre-existing merge feature (priority_vars) that allows us to handle merges with some variables taking precedence. This feature is currently used for cases like ds.coords['x'] = data_array when data_array already has a (potentially conflicting) coordinate 'x'.

The rule we could use would be: - For .sel()/.loc[], indexed coordinates from the indexed object take precedence in the result ([obj.coords[k] for k in kwargs] for obj.sel(**kwargs)). Conflicts with indexed coordinates on indexing objects are silently ignored. - For reindex(), indexing coordinates take precedence in the result ([kwargs[k] for k in kwargs] for obj.reindex(**kwargs)). Conflicts with indexed coordinates on the indexed object are silently ignored. - For isel()/[], neither set of indexed coordinates take precedence.

Which we would use with normal rule for dimension/non-dimension coordinates: - Conflicts between dimension coordinates (except for precedence) result in an error. - Conflicts between non-dimension coordinates result in silently dropping the conflicting variable.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
328508165 https://github.com/pydata/xarray/pull/1473#issuecomment-328508165 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyODUwODE2NQ== fujiisoup 6815844 2017-09-11T12:02:25Z 2017-09-11T12:02:25Z MEMBER

To be honest, it's still not clear to me which is the right choice.

It's not yet clear to me either. I think, in such a case, we should choose the simplest rule so that we can explain it easily and we could add more rule later if necessary.

I think indexer's coordinates should not conflict is the simplest. The other end might be we don't care the indexer's coordinates, but I like the previous one.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
328438707 https://github.com/pydata/xarray/pull/1473#issuecomment-328438707 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyODQzODcwNw== shoyer 1217238 2017-09-11T07:14:55Z 2017-09-11T07:14:55Z MEMBER

I thought we have agreed to simply neglect the coordinate conflict (comment).

To be honest, it's still not clear to me which is the right choice.

Some considerations:

  • Coordinates are likely to differ by only a small amount in some practical settings, e.g., when using method='nearest'. It will be annoying to need to ensure coordinate alignment in such cases. For example, ds.reindex_like(other, method='nearest') would no longer work.
  • Dropping coordinate coordinates is not too difficult, but is somewhat annoying, because it requires users to lookup a new method (e.g.,reset_index()). Even for me, I had to do a little bit of experimentation to pick the right method. reset_index() does not have a default of resetting all indexes, which makes this slightly more annoying still (this would not be hard to fix).
  • There are situations where silently ignoring a conflict could result in silently corrupted results. This seems most likely to me with boolean or integer (isel()) indexing, where the indexer could have entries in the wrong order. However, this is unlikely with label-based indexing (sel or reindex), because the labels are already (redundantly) specified in the indexer values.

One possible resolution is to require exactly matching dimension coordinates only for isel() but not sel. However, this could be tricky to implement (sel is written in terms of isel) and could also be surprising to users, who expect sel() and isel() to work exactly the same except for expecting coordinates vs integer positions.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
328403754 https://github.com/pydata/xarray/pull/1473#issuecomment-328403754 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyODQwMzc1NA== fujiisoup 6815844 2017-09-11T03:04:05Z 2017-09-11T03:04:05Z MEMBER

Thank you for review.

I thought we agreed that these cases should raise an error, i.e., to require exact alignment?

I thought we have agreed to simply neglect the coordinate conflict (comment).

Yes, but now I agree to raise an IndexError is clearer for users. I will revert it.

The interaction with MultiIndex indexing seems to be somewhat off. Compare:

I forgot to consider MultiIndex. I will fix it.

(May be later this week.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
328401215 https://github.com/pydata/xarray/pull/1473#issuecomment-328401215 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyODQwMTIxNQ== shoyer 1217238 2017-09-11T02:42:26Z 2017-09-11T02:42:26Z MEMBER

Two issues I encountered when testing this locally:

  1. Conflicting coordinates on indexers seem to be ignored: ``` In [4]: ds = xr.Dataset({'bar': ('x', [1, 2])}, {'x': [1, 2]})

In [5]: ds.isel(x=xr.DataArray([0, 1], [('x', [3, 4])])) Out[5]: <xarray.Dataset> Dimensions: (x: 2) Coordinates: * x (x) int64 1 2 Data variables: bar (x) int64 1 2 ``` I thought we agreed that these cases should raise an error, i.e., to require exact alignment? It's one thing to drop non-dimension coordinates, but dimension coordinates should not be ignored.

  1. The interaction with MultiIndex indexing seems to be somewhat off. Compare: ``` In [15]: midx = pd.MultiIndex.from_product([list('abc'), [0, 1]], ...: names=('one', 'two')) ...: mda = xr.DataArray(np.random.rand(6, 3), ...: [('x', midx), ('y', range(3))]) ...:

The multi-index remains with the name "x"

In [16]: mda.isel(x=xr.DataArray(np.arange(3), dims='z')) Out[16]: <xarray.DataArray (z: 3, y: 3)> array([[ 0.990021, 0.371052, 0.996406], [ 0.384432, 0.605875, 0.361161], [ 0.367431, 0.339736, 0.816142]]) Coordinates: x (z) object ('a', 0) ('a', 1) ('b', 0) * y (y) int64 0 1 2 Dimensions without coordinates: z

the multi-index is now called "z"

In [17]: mda.sel(x=xr.DataArray(mda.indexes['x'][:3], dims='z')) Out[17]: <xarray.DataArray (z: 3, y: 3)> array([[ 0.990021, 0.371052, 0.996406], [ 0.384432, 0.605875, 0.361161], [ 0.367431, 0.339736, 0.816142]]) Coordinates: x (z) object ('a', 0) ('a', 1) ('b', 0) * y (y) int64 0 1 2 * z (z) MultiIndex - one (z) object 'a' 'a' 'b' - two (z) int64 0 1 0 ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
328396835 https://github.com/pydata/xarray/pull/1473#issuecomment-328396835 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyODM5NjgzNQ== shoyer 1217238 2017-09-11T02:04:42Z 2017-09-11T02:04:42Z MEMBER

Just sent out a bunch of doc edits: https://github.com/fujiisoup/xarray/pull/4

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
327802435 https://github.com/pydata/xarray/pull/1473#issuecomment-327802435 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNzgwMjQzNQ== fujiisoup 6815844 2017-09-07T13:40:30Z 2017-09-07T13:40:30Z MEMBER

Added, with some code clean-ups.

(Although I prepared tests for it), I agree that the boolean indexing with a different dimension name is a rare use case

But I personally think this rule adds additional complexity. From an analogy of np.ndarray indexing python da.values[(da.y > -1).values] this looks like a just a mis-coding of python da.values[:, (da.y > -1).values] and this error may be user's responsibility.

I think we should recommend da.isel(y=(da.y>-1)) instead.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
327670299 https://github.com/pydata/xarray/pull/1473#issuecomment-327670299 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNzY3MDI5OQ== shoyer 1217238 2017-09-07T03:02:18Z 2017-09-07T03:02:18Z MEMBER

Thinking about boolean indexing again. I think we possibly only allow using unlabeled boolean array or boolean arrays defined along the dimension they are indexing.

My concern is that otherwise, we will rule out the possibility of making data_array[boolean_key] equivalent to data_array.where(boolean_key, drop=True). For example, consider the current behavior with your branch: ``` In [29]: da = xr.DataArray(np.arange(100).reshape(10, 10), dims=['x', 'y'])

In [30]: da[da.x > -1] Out[30]: <xarray.DataArray (x: 10, y: 10)> array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47, 48, 49], [50, 51, 52, 53, 54, 55, 56, 57, 58, 59], [60, 61, 62, 63, 64, 65, 66, 67, 68, 69], [70, 71, 72, 73, 74, 75, 76, 77, 78, 79], [80, 81, 82, 83, 84, 85, 86, 87, 88, 89], [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]]) Dimensions without coordinates: x, y

In [31]: da[da.y > -1] Out[31]: <xarray.DataArray (y: 10)> array([ 0, 11, 22, 33, 44, 55, 66, 77, 88, 99]) Dimensions without coordinates: y ```

The only way these can be guaranteed to be consistent with where(drop=True) is if we only allow the first indexing argument to index along the first dimension (outer/orthogonal indexing style).

I can see some potential use for boolean indexing with a different dimension name, but I suspect it would be pretty rare. There is also an easy work around (using an integer indexer instead, which is also arguably clearer).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
327194050 https://github.com/pydata/xarray/pull/1473#issuecomment-327194050 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNzE5NDA1MA== fujiisoup 6815844 2017-09-05T14:31:28Z 2017-09-05T14:31:28Z MEMBER

Thank you for the careful review. I updated most part you pointed out, but not all. I will finish it up tomorrow.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
326857965 https://github.com/pydata/xarray/pull/1473#issuecomment-326857965 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNjg1Nzk2NQ== fujiisoup 6815844 2017-09-04T03:24:49Z 2017-09-04T03:24:49Z MEMBER

So for now, let's raise a FutureWarning

OK. Done.

if supplying a DataArray with array.coords[dim].values != array.values.

I think the condition is something like array.dims != (dim, ), where in the future version we will consider the dimension of indexers.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
326818830 https://github.com/pydata/xarray/pull/1473#issuecomment-326818830 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNjgxODgzMA== shoyer 1217238 2017-09-03T17:28:46Z 2017-09-03T17:28:46Z MEMBER

I considered the rasm data, where the original object stays on the logical coordinates x and y. If we have conversion DataArrays, such as a table of x and y values as a function of target coordinates lat and lon, then the coordinate projection from (x, y) to (lat, lon) can be done by .sel(x=x, y=y, method='nearest'). This might be a kind of multi-dimensional reindex?

In such a use case, it would be better for sel (or multi-dimensional reindex) to return NaN than to raise an error.

I agree, this feels closer to a use-case for multi-dimensional reindex rather than sel.

Let's recall the use cases fro these methods: - sel is for selecting data on its existing coordinates - reindex is for imposing new coordinates on data

So one possible way to define multi-dimensional reindexing would be as follows: - Given reindex arguments of the form dim=array where array is a 1D unlabeled array/list, convert them into DataArray(array, [(dim, array)]). - Do multi-dimensional indexing with broadcasting like sel, but fill in NaN for missing values (we could allow for customizing this with a fill_value argument). - Join coordinates like for sel, but coordinates from the indexers take precedence over coordinates from the object being indexed.

In practice, multi-dimensional reindex and sel are very similar if there is no overlap between coordinates on the indexer/indexed objects.

I also would like to try that, but it might be a bit tough and it would be better to do after the next release.

:+1:

So for now, let's raise a FutureWarning if supplying a DataArray with array.coords[dim].values != array.values.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
326778667 https://github.com/pydata/xarray/pull/1473#issuecomment-326778667 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNjc3ODY2Nw== fujiisoup 6815844 2017-09-03T01:30:49Z 2017-09-03T02:00:46Z MEMBER

for inexact indexing (e.g., method='nearest'), the result of reindex copies the index from the indexers, whereas the result of sel copies the index from the object being indexed

Yes, this is another difference, but if the indexer of sel has a coordinate, the behavior becomes closer to reindex.

I don't know quite what it would mean to reindex with a multi-dimensional indexer

I thought this when working with the power-user example,

It would be really nice to also have a power-user example of pointwise indexing with 2D indexers and nearest-neighbor lookups, e.g., to switch to another coordinate projection. Something like ds.sel(latitude=latitude_grid, longitude=longitude_grid, method='nearest', tolerance=0.1).

As this example, I considered the rasm data, where the original object stays on the logical coordinates x and y. If we have conversion DataArrays, such as a table of x and y values as a function of target coordinates lat and lon, then the coordinate projection from (x, y) to (lat, lon) can be done by .sel(x=x, y=y, method='nearest'). This might be a kind of multi-dimensional reindex?

In such a use case, it would be better for sel (or multi-dimensional reindex) to return NaN than to raise an error.

From a practical perspective, writing a version of vectorized indexing that fills in NaN could be non-trivial.

I agree. I also would like to try that, but it might be a bit tough and it would be better to do after the next release.

Maybe I need to switch to much easier task in this example. Do you have any suggestion?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
326776669 https://github.com/pydata/xarray/pull/1473#issuecomment-326776669 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNjc3NjY2OQ== shoyer 1217238 2017-09-03T00:27:05Z 2017-09-03T00:27:05Z MEMBER

API question. .sel(x=[0.0, 1.0], method='nearest', tolerance=0.1) should work exactly same as .reindex(x=[0.0, 1.0], method='nearest', tolerance=0.1)?

There are two key differences between sel and reindex: - reindex inserts NaN when there is not a match whereas sel raises an error - for inexact indexing (e.g., method='nearest'), the result of reindex copies the index from the indexers, whereas the result of sel copies the index from the object being indexed

My preference is to make sel work as reindex currently does and to gradually deprecate reindex method, because now the difference between these two methods are very tiny.

I'm not sure this is desirable, because it's nice to have a way to do indexing that is guaranteed not to introduce missing values.

Currently, reindex only supports indexing with 1D arguments, and the values of those arguments are taken to be the new index coordinates. I don't know quite what it would mean to reindex with a multi-dimensional indexer -- I guess the result would gain multi-dimensional coordinate indexes? Also, when reindexing like ds.reindex(x=indexer), which coordinates take precedence on the result for x -- indexer.coords['x'] or indexer.values?

I do think there is a valid concern about consistency between sel() and reindex(). Right now, coordinates and dimensions on arguments to reindex are entirely ignored. If we are ever going to allow reindexing with multi-dimensional arguments (and broadcasting), we should consider raising an error or warning now when passed indexers with inconsistent dimensions/coordinates.

From a practical perspective, writing a version of vectorized indexing that fills in NaN could be non-trivial. To enable this under the hood, I think we would need a version of ndarray.__getitem__ that uses a sentinel value (e.g., -1) to fill in NaN instead of doing indexing. I guess this could probably be done with a combination of NumPy's advanced indexing plus a mask.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
326753554 https://github.com/pydata/xarray/pull/1473#issuecomment-326753554 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNjc1MzU1NA== fujiisoup 6815844 2017-09-02T16:10:59Z 2017-09-02T16:10:59Z MEMBER

API question.

.sel(x=[0.0, 1.0], method='nearest', tolerance=0.1) should work exactly same as .reindex(x=[0.0, 1.0], method='nearest', tolerance=0.1)? Currently, .sel method raises KeyError if there is no corresponding value in x. reindex returns np.nan if there is no matching value.

My preference is to make sel work as reindex currently does and to gradually deprecate reindex method, because now the difference between these two methods are very tiny.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
326367356 https://github.com/pydata/xarray/pull/1473#issuecomment-326367356 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNjM2NzM1Ng== shoyer 1217238 2017-08-31T17:32:02Z 2017-08-31T17:32:02Z MEMBER

I agree that this is pretty close! I will do another review shortly when I have time.

This is a really exciting feature and I am super excited to get it into v0.10 -- thanks again for your hard work on it! This goes a long way to completing xarray's labeled data model.

The consensus over in #1535 seems to be that we can go ahead without a deprecation warning.

Also: I agree that it would be great if others can test this, but we should also definitely make a release candidate for 0.10 to help iron out any bugs before the final release.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
326334294 https://github.com/pydata/xarray/pull/1473#issuecomment-326334294 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNjMzNDI5NA== fujiisoup 6815844 2017-08-31T15:36:49Z 2017-08-31T15:39:24Z MEMBER
  • [x] Closes #1444, #1436
  • [x] Tests added / passed
  • [x] Passes git diff master | flake8 --diff
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

I think I am approaching.

See docs for the detail, but the essential change of this PR is that now indexing ([], .loc[], .sel(), .isel()) considers indexers dimension. By passing xr.DataArray as indexers, we can realize many types of advanced indexing, which is done previously by special methods isel_points, sel_points, and reindex. (isel_points and sel_points are deprecated by this PR.)

If indexers have no named dimension (e.g. np.ndarray, integer, slice), then the indexing behaves exactly the same way to the current version. So this change should be compatible almost all the existing codes.

Now all the existing tests passed and I added many test cases as far as I think of. However, I would like to ask members to use this branch for your daily work and make sure there is no inconvenience, because indexing is very fundamental and a single bug would affect every user significantly.

Any comments or thoughts are welcome.

(I refactored indexing.rst largely according to this change. I would also appreciate very much if anyone could point out some confusing/unnatural sentences.)

I am looking forward to seeing it in v.0.10 :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
325141121 https://github.com/pydata/xarray/pull/1473#issuecomment-325141121 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNTE0MTEyMQ== fujiisoup 6815844 2017-08-26T15:59:18Z 2017-08-26T16:06:11Z MEMBER

I still think I would prefer including all coordinates from indexers unless there is already existing coordinates of the same name.

OK. I agree. It's might be the best. I will update the code.

But I can't yet imagine all the cases that are incompatible the existing code. I am just wondering if we could bring such a sudden change without warning.

Do we need API change warning period?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
325066185 https://github.com/pydata/xarray/pull/1473#issuecomment-325066185 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNTA2NjE4NQ== shoyer 1217238 2017-08-26T00:52:03Z 2017-08-26T00:52:03Z MEMBER

Thinking about this more: I still think I would prefer including all coordinates from indexers unless there is already an existing coordinates of the same name.

Reasons why I like this rule: 1. It's simple and easy to explain, without any special cases. 2. If the coordinates are descriptive of the indexers, I think they're almost certainly still descriptive of the indexed results. 3. Users seem to be happier with keeping around metadata (e.g., attrs) even in cases where it may be slightly outdated than needing to propagate it manually. One reason may be that you only need to drop irrelevant metadata once from results, but keeping it around through operations that drop it requires updating each operation.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
324960486 https://github.com/pydata/xarray/pull/1473#issuecomment-324960486 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyNDk2MDQ4Ng== shoyer 1217238 2017-08-25T15:50:22Z 2017-08-25T15:50:22Z MEMBER

If ind.dims == (k, ) (indexing-DataArray has the same dimension to the dimension to be indexed along), we neglect ind.coords[k]. If ind.dims != (k, ) and ind.dims not in da.dims, then we attach a new coordinate ind.coords[ind.dims] If ind.dims != (k, ) and ind.dims in da.dims, then raise an Error.

I like these rules.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
321213528 https://github.com/pydata/xarray/pull/1473#issuecomment-321213528 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyMTIxMzUyOA== fujiisoup 6815844 2017-08-09T10:10:58Z 2017-08-09T11:10:14Z MEMBER

@shoyer

Overwrite da['x'] by ind['x'] seems bad.

Agreed. I changed the code not to overwrite coordinate.

My inclination would be that it's OK to add coordinates from the indices as long as they aren't conflicting.

My current implementation does this, but I am still worrying about even in this situation there will be a similar unexpected behavior, e.g. because an unexpected coordinate is attached in previous indexing, the coordinate to be attached would be silently neglected.

I think we may need a careful API decision. I am currently thinking (assuming indexers = {k: ind}) + If ind.dims == (k, ) (indexing-DataArray has the same dimension to the dimension to be indexed along), we neglect ind.coords[k]. + If ind.dims != (k, ) and ind.dims not in da.dims, then we attach a new coordinate ind.coords[ind.dims] + If ind.dims != (k, ) and ind.dims in da.dims, then raise an Error.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
321148108 https://github.com/pydata/xarray/pull/1473#issuecomment-321148108 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyMTE0ODEwOA== shoyer 1217238 2017-08-09T04:16:03Z 2017-08-09T04:16:03Z MEMBER

@fujiisoup you're asking good questions about how to handle coordinates.

I don't have a lot of time to think about this right now, but really briefly: - Overwrite da['x'] by ind['x'] seems bad. This seems contrary to how indexing is supposed to work. I don't think we ever want to override/change coordinates in the indexed object. - My inclination would be that it's OK to add coordinates from the indices as long as they aren't conflicting (with either other indices or the indexed objected). This will cause some increase in the number of coordinates but I don't think this would be too bad.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
321119003 https://github.com/pydata/xarray/pull/1473#issuecomment-321119003 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyMTExOTAwMw== fujiisoup 6815844 2017-08-09T00:32:46Z 2017-08-09T00:32:46Z MEMBER

Another case I am wondering is when indexing-DataArray has the same name but different valued coordinate,

```python In [1]: import numpy as np ...: import xarray as xr ...: ...: da = xr.DataArray(np.arange(3 * 2).reshape(3, 2), dims=['x', 'y'], ...: coords={'x': [0, 1, 2], 'y': ['a', 'b']}) ...: da # indexed DataArray ...: Out[1]: <xarray.DataArray (x: 3, y: 2)> array([[0, 1], [2, 3], [4, 5]]) Coordinates: * x (x) int64 0 1 2 * y (y) <U1 'a' 'b'

In [2]: ind = xr.DataArray([2, 1], dims=['x'], coords={'x': [0.1, 0.2]}) ...: ind # indexing DataArray. This also has 'x' ...: Out[2]: <xarray.DataArray (x: 2)> array([2, 1]) Coordinates: * x (x) float64 0.1 0.2

In [3]: da.isel(x=ind.variable) ...: Out[3]: <xarray.DataArray (x: 2, y: 2)> array([[4, 5], [2, 3]]) Coordinates: * x (x) int64 2 1 * y (y) <U1 'a' 'b'

In [4]: da.isel(x=ind) # Overwrite da['x'] by ind['x'] ...: Out[4]: <xarray.DataArray (x: 2, y: 2)> array([[4, 5], [2, 3]]) Coordinates: * x (x) float64 0.1 0.2 * y (y) <U1 'a' 'b' ```

Currently, the original coordinate is overwritten by the indexer's coordinate, but it may cause an unintentional change of the coordinate values. May be we should keep the original coordinate in such a case or raise an Exception?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
321115367 https://github.com/pydata/xarray/pull/1473#issuecomment-321115367 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyMTExNTM2Nw== fujiisoup 6815844 2017-08-09T00:06:10Z 2017-08-09T00:06:10Z MEMBER

I am wondering how the indexing by DataArray should look like, in particular if the indexing-DataArray has coordinates.

In my current implementation, it bahaves

```python In [1]: import numpy as np ...: import xarray as xr ...: ...: da = xr.DataArray(np.arange(3 * 2).reshape(3, 2), dims=['x', 'y'], ...: coords={'x': [0, 1, 2], 'y': ['a', 'b']}) ...: da ...: Out[1]: <xarray.DataArray (x: 3, y: 2)> array([[0, 1], [2, 3], [4, 5]]) Coordinates: * x (x) int64 0 1 2 * y (y) <U1 'a' 'b'

In [2]: ind = xr.DataArray([2, 1], dims=['a'], ...: coords={'a': [0.1, 0.2], 'time': (('a', ), [10, 20])}) ...: ind # 'a': coordinate, 'time': non-dimension coordinate ...: Out[2]: <xarray.DataArray (a: 2)> array([2, 1]) Coordinates: * a (a) float64 0.1 0.2 time (a) int64 10 20

In [3]: da.isel(x=ind) Out[3]: <xarray.DataArray (a: 2, y: 2)> array([[4, 5], [2, 3]]) Coordinates: x (a) int64 2 1 * y (y) <U1 'a' 'b' * a (a) float64 0.1 0.2 ```

I think we should keep (indexed-version of the) original coordinate (da['x']) even after the indexing. We may need also ind['a'] as a new coordinate. Can we ignore the non-dimensional coordinate in the indexer (ind['time'])?

I am slightly worrying that after repetitive indexing the number of coordinates may unintentionally increase.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
320493296 https://github.com/pydata/xarray/pull/1473#issuecomment-320493296 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyMDQ5MzI5Ng== fujiisoup 6815844 2017-08-06T08:28:14Z 2017-08-06T08:28:14Z MEMBER

@shoyer Thanks for your help! I will take a look.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
320491761 https://github.com/pydata/xarray/pull/1473#issuecomment-320491761 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMyMDQ5MTc2MQ== shoyer 1217238 2017-08-06T07:51:30Z 2017-08-06T07:51:30Z MEMBER

I opened another PR to your branch for consolidating the logic between dask and numpy vindex: https://github.com/fujiisoup/xarray/pull/3

You will need to install dask master to run the full test suite, but it appears to be working! The logic is slightly tricky because even with vindex we need to reorder the sliced dimensions sometimes.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
319113096 https://github.com/pydata/xarray/pull/1473#issuecomment-319113096 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxOTExMzA5Ng== shoyer 1217238 2017-07-31T15:55:27Z 2017-07-31T15:55:27Z MEMBER

I think a similar logic (flatten -> lookup -> reshape) will be necessary to improve .sel method, (or indexing.get_indexer() function), as our new sel should support multi-dimensional look up.

Yes, agreed!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
319028665 https://github.com/pydata/xarray/pull/1473#issuecomment-319028665 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxOTAyODY2NQ== fujiisoup 6815844 2017-07-31T10:22:44Z 2017-07-31T10:23:04Z MEMBER

Thanks, @shoyer. I will take a look shortly.

Added pointwise indexing support for dask using vindex.

Thanks. It's a great help! I think a similar logic (flatten -> lookup -> reshape) will be necessary to improve .sel method, (or indexing.get_indexer() function), as our new sel should support multi-dimensional look up.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
318975595 https://github.com/pydata/xarray/pull/1473#issuecomment-318975595 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxODk3NTU5NQ== shoyer 1217238 2017-07-31T05:59:46Z 2017-07-31T06:00:03Z MEMBER

I added a few more commits to my PR to your branch (https://github.com/fujiisoup/xarray/pull/2): - Reorganized test_variable.py. TestVariable_withDask is a good idea, but it needs to inherit from VariableSubclassTestCases, not TestVariable. Otherwise many of the base-class tests get run twice. - Added pointwise indexing support for dask using vindex. The logic is somewhat convoluted but I think it works (review would be appreciated!). This will let us deprecate isel_points/sel_points.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
318813583 https://github.com/pydata/xarray/pull/1473#issuecomment-318813583 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxODgxMzU4Mw== fujiisoup 6815844 2017-07-29T08:26:30Z 2017-07-29T08:26:30Z MEMBER

@shoyer

Thanks for the detailed review.

I don't think we want to index non-xarray types with IndexerTuple subclasses. It's probably best to any convert them into base tuple() objects before indexing.

Yes. Actually, some backends seem to check something like if type(key) is tuple if key is empty. So, in my previous implementation, I manually converted its instance type in case of an empty tuple. I added to_tuple() method to IndexerTuple class and called it in all the basic ArrayWrapper. I think this made the code a bit cleaner.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
318111741 https://github.com/pydata/xarray/pull/1473#issuecomment-318111741 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxODExMTc0MQ== shoyer 1217238 2017-07-26T16:41:40Z 2017-07-26T16:41:40Z MEMBER

I'm slightly hesitating to deprecate this indexing in this PR. I guess it should go with another issue. (Some tests assume this indexing behavior.)

OK, we can certainly discuss this more broadly. But I think this test was just broken.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
317902607 https://github.com/pydata/xarray/pull/1473#issuecomment-317902607 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNzkwMjYwNw== fujiisoup 6815844 2017-07-25T23:29:10Z 2017-07-25T23:29:10Z MEMBER

No, for 1D boolean arrays I think we should insist that sizes match exactly.

OK. Thanks for the suggestion.

I'm slightly hesitating to deprecate this indexing in this PR. I guess it should go with another issue. (Some tests assume this indexing behavior.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
317795244 https://github.com/pydata/xarray/pull/1473#issuecomment-317795244 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNzc5NTI0NA== shoyer 1217238 2017-07-25T16:35:09Z 2017-07-25T16:35:09Z MEMBER

As you suggested, I prepared our own indexer class. I think the codes became much cleaner.

Thanks! I'll take a look shortly.

Maybe this behavior was deprecated in numpy?

Yes, I think NumPy has deprecated/removed this sort of indexing.

In my current PR, the boolean index is simply converted to integer array by .nonzero() method, so xarray works with such boolean array with different size. Is it what we want?

No, for 1D boolean arrays I think we should insist that sizes match exactly. There is no obvious map between boolean indexer positions and array values otherwise.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
317742791 https://github.com/pydata/xarray/pull/1473#issuecomment-317742791 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNzc0Mjc5MQ== fujiisoup 6815844 2017-07-25T13:48:12Z 2017-07-25T13:48:12Z MEMBER

@shoyer Thanks for the suggestion. As you suggested, I prepared our own indexer class. I think the codes became much cleaner.

I struggled with the boolean index behavior, python np.random.randn(10, 20)[np.arange(8) < 5] which works in my laptop but fails in travis. Maybe this behavior was deprecated in numpy?

In my current PR, the boolean index is simply converted to integer array by .nonzero() method, so xarray works with such boolean array with different size. Is it what we want?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
317465968 https://github.com/pydata/xarray/pull/1473#issuecomment-317465968 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNzQ2NTk2OA== shoyer 1217238 2017-07-24T15:48:33Z 2017-07-24T15:48:33Z MEMBER

With the current logic, we normalize everything into a standard indexer tuple in Variable.__getitem__. I think we should explicitly create different kinds of indexers, and then handle them explicitly in various backends/array wrappers, e.g., ```python

in core/indexing.py

class IndexerTuple(tuple): """Base class for xarray indexing tuples."""

def __repr__(self):
    """Repr that shows type name."""
    return type(self).__name__ + tuple.__repr__(self)

class BasicIndexer(IndexerTuple): """Tuple for basic indexing."""

class OuterIndexer(IndexerTuple): """Tuple for outer/orthogonal indexing (.oindex)."""

class VectorizedIndexer(IndexerTuple): """Tuple for vectorized indexing (.vindex)."""

in core/variable.py

class Variable(...): def _broadcast_indexes(self, key): # return a BasicIndexer if possible, otherwise an OuterIndexer if possible # and finally a VectorizedIndexer

in adapters for various backends/storage types

class DaskArrayAdapter(...): def getitem(self, key): if isinstance(key, VectorizedIndexer): raise IndexError("dask doesn't yet support vectorized indexing") ... ```

This is a little more work at the outset because we have to handle each indexer type in each backend, but it avoids the error prone broadcasting/un-broadcasting logic.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
317247671 https://github.com/pydata/xarray/pull/1473#issuecomment-317247671 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNzI0NzY3MQ== fujiisoup 6815844 2017-07-23T11:51:00Z 2017-07-23T11:51:00Z MEMBER

1 Backends support only "basic indexing "(int and slice). This is pretty common. 2 Backends support some of the "advanced indexing" use cases but not everything (e.g., restricted to most one list). This is also pretty common (e.g., dask and h5py). 3 Backends support "orthogonal indexing" instead of advanced indexing. NetCDF4 does this (but perform can be pretty terrible). 4 Backends support NumPy's fully vectorized "advanced indexing". This is quite rare -- I've only seen this for backends that actually store their data in the form of NumPy arrays (e.g., scipy.io.netcdf).

I am wondering what the cleanest design is. Because the cases 3 and 4 you suggested are pretty exculsive, I tried to distinguish cases 1, 2, and 4 in Variable._broadcast_indexes(key) method. For backends that only accept orthogonal indexing, I think case 4 indexers can be orthogonalized in each ArrayWrappers (by indexing._unbroadcast_indexes function).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
317137217 https://github.com/pydata/xarray/pull/1473#issuecomment-317137217 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNzEzNzIxNw== fujiisoup 6815844 2017-07-21T23:54:38Z 2017-07-21T23:54:38Z MEMBER

Currently, this line in backends/rasterio_.py fails. This is because that the new indexing logic converts integer-arrays into slices as much as possible, e.g. [0, 2] is converted to slice(0, 3, 2) which is currently regarded as an invalid indexer in rasterio.

However, other array wrappers seem to require the automatic slice conversion. As I am not familiar with rasterio, I will appreciate if anyone gives me a help.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
317012255 https://github.com/pydata/xarray/pull/1473#issuecomment-317012255 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNzAxMjI1NQ== fujiisoup 6815844 2017-07-21T14:13:44Z 2017-07-21T14:13:44Z MEMBER

I think we'll also want to make an "vectorized to orthogonal" indexing adapter that we can use netCDF4.Variable

I implemented BroadcastIndexedAdapter that converts broadcasted-indexer back to orthogonal-indexer. Former LazilyIndexedArray is renamed to OrthogonalLazilyIndexedArray and new LazilyIndexedArray now accepts broadcasted-indexers. (Some tests related to backend still fail.)

Now some array-adaptors accepts orthogonal-indexers and other accepts broadcasted-indexers. I think it is a little confusing. Maybe clearer terminology is necessary?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
315746067 https://github.com/pydata/xarray/pull/1473#issuecomment-315746067 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNTc0NjA2Nw== fujiisoup 6815844 2017-07-17T12:49:46Z 2017-07-17T12:50:55Z MEMBER

@shoyer Thanks for your help.

Let's not worry about supporting every indexing type with dask.

Yes. Thanks to your patch, dask-based variable is now indexed fine.

Some replies to your comments to the outdated codes. + multidimensional boolean indexer
Agree. I added a sanity check and raise IndexError in case of multi-dimensional boolean array. + indexer type in DasokIndexingAdapter
Because I changed how indexing.broadcasted_indexable (formally indexing.orthogonally_indexable) is called, indexers passed to DaskIndexingAdapter are already broadcasted to Variables (in case of _broadcast_indexes_advanced).

I will try to fit the other array wrappers, LazilyIndexedArray, CopyOnWriteArray, MemoryCachedArray to the broadcasted indexers, the tests of which currently fail. (Maybe I will need another help.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
315659641 https://github.com/pydata/xarray/pull/1473#issuecomment-315659641 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNTY1OTY0MQ== shoyer 1217238 2017-07-17T02:58:57Z 2017-07-17T02:58:57Z MEMBER

Let's not worry about supporting every indexing type with dask. I think that with my patch we can do everything we currently do. We'll want vindex support eventually as well so we can remove isel_points(), but that can come later.

I think we'll also want to make an "vectorized to orthogonal" indexing adapter that we can use netCDF4.Variable.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
315631661 https://github.com/pydata/xarray/pull/1473#issuecomment-315631661 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNTYzMTY2MQ== shoyer 1217238 2017-07-16T19:35:40Z 2017-07-16T19:35:40Z MEMBER

dask.array does support a limited form of fancy indexing via .vindex. I think we already use it in .isel_points(). It would certainly be better to use that and error in edge cases rather than silently converting to numpy arrays, which we never want to do.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
315593569 https://github.com/pydata/xarray/pull/1473#issuecomment-315593569 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNTU5MzU2OQ== fujiisoup 6815844 2017-07-16T08:19:11Z 2017-07-16T08:19:11Z MEMBER

I just realized that dask's indexing is limited, e.g. it does not support nd-array indexing. I will try to make a work around this issue, but I am not very familiar with dask. If anyone gives me any idea for this, it would be helpful.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773
314718320 https://github.com/pydata/xarray/pull/1473#issuecomment-314718320 https://api.github.com/repos/pydata/xarray/issues/1473 MDEyOklzc3VlQ29tbWVudDMxNDcxODMyMA== fujiisoup 6815844 2017-07-12T10:13:48Z 2017-07-12T10:13:48Z MEMBER

Thanks @shoyer I updated _broadcast_indexes method based on your reference script.

As you pointed out, we may need better Error message in here. I guess we should raise our own Exception class in as_variable and replace the message here? Duplication of as_variable function only for better exception message sounds a bad idea.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: indexing with broadcasting 241578773

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 18.105ms · About: xarray-datasette