pull_requests: 122418207
This data as json
id | node_id | number | state | locked | title | user | body | created_at | updated_at | closed_at | merged_at | merge_commit_sha | assignee | milestone | draft | head | base | author_association | auto_merge | repo | url | merged_by |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
122418207 | MDExOlB1bGxSZXF1ZXN0MTIyNDE4MjA3 | 1426 | closed | 0 | scalar_level in MultiIndex | 6815844 | - [x] Closes #1408 - [x] Tests added / passed - [x] Passes ``git diff upstream/master | flake8 --diff`` - [ ] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API [Edit for more clarity] I restarted a new branch to fix #1408 (I closed the older one #1412).  Because the changes I made is relatively large, here I summarize this PR. # Sumamry In this PR, I newly added two kinds of levels in MultiIndex, `index-level` and `scalar-level`. `index-level` is an ordinary level in MultiIndex (as in current implementation), while `scalar-level` indicates dropped level (which is newly added in this PR). # Changes in behaviors. 1. Indexing a scalar at a particular level changes that level to `scalar-level` instead of dropping that level (changed from #767). 2. Indexing a scalar from a MultiIndex, the selected value now becomes a `MultiIndex-scalar` rather than a scalar of tuple. 3. Enabled indexing along a `index-level` if the MultiIndex has only a single `index-level`. Examples of the output are shown below. Any suggestions for these behaviors are welcome. ```python In [1]: import numpy as np ...: import xarray as xr ...: ...: ds1 = xr.Dataset({'foo': (('x',), [1, 2, 3])}, {'x': [1, 2, 3], 'y': 'a'}) ...: ds2 = xr.Dataset({'foo': (('x',), [4, 5, 6])}, {'x': [1, 2, 3], 'y': 'b'}) ...: # example data ...: ds = xr.concat([ds1, ds2], dim='y').stack(yx=['y', 'x']) ...: ds Out[1]: <xarray.Dataset> Dimensions: (yx: 6) Coordinates: * yx (yx) MultiIndex - y (yx) object 'a' 'a' 'a' 'b' 'b' 'b' # <--- this is index-level - x (yx) int64 1 2 3 1 2 3 # <--- this is also index-level Data variables: foo (yx) int64 1 2 3 4 5 6 In [2]: # 1. indexing a scalar converts `index-level` x to `scalar-level`. ...: ds.sel(x=1) Out[2]: <xarray.Dataset> Dimensions: (yx: 2) Coordinates: * yx (yx) MultiIndex - y (yx) object 'a' 'b' # <--- this is index-level - x int64 1 # <--- this is scalar-level Data variables: foo (yx) int64 1 4 In [3]: # 2. indexing a single element from MultiIndex makes a `MultiIndex-scalar` ...: ds.isel(yx=0) Out[3]: <xarray.Dataset> Dimensions: () Coordinates: yx MultiIndex # <--- this is MultiIndex-scalar - y <U1 'a' - x int64 1 Data variables: foo int64 1 In [6]: # 3. Enables to selecting along a `index-level` if only one `index-level` exists in MultiIndex ...: ds.sel(x=1).isel(y=[0,1]) Out[6]: <xarray.Dataset> Dimensions: (yx: 2) Coordinates: * yx (yx) MultiIndex - y (yx) object 'a' 'b' - x int64 1 Data variables: foo (yx) int64 1 4 ``` # Changes in the public APIs Some changes were necessary to the public APIs, though I tried to minimize them. + `level_names`, `get_level_values` methods were moved from `IndexVariable` to `Variable`. This is because `IndexVariable` cannnot handle 0-d array, which I want to support in 2. + `scalar_level_names` and `all_level_names` properties were added to `Variable` + `reset_levels` method was added to `Variable` class to control `scalar-level` and `index-level`. # Implementation summary The main changes in the implementation is the addition of our own wrapper of `pd.MultiIndex`, `PandasMultiIndexAdapter`. This does most of `MultiIndex`-related operations, such as indexing, concatenation, conversion between 'scalar-level` and `index-level`. # What we can do now The main merit of this proposal is that it enables us to handle `MultiIndex` more consistent way to the normal `Variable`. Now we can + recover the MultiIndex with dropped level. ```python In [5]: ds.sel(x=1).expand_dims('x') Out[5]: <xarray.Dataset> Dimensions: (yx: 2) Coordinates: * yx (yx) MultiIndex - y (yx) object 'a' 'b' - x (yx) int64 1 1 Data variables: foo (yx) int64 1 4 ``` + construct a MultiIndex by concatenation of MultiIndex-scalar. ```python In [8]: xr.concat([ds.isel(yx=i) for i in range(len(ds['yx']))], dim='yx') Out[8]: <xarray.Dataset> Dimensions: (yx: 6) Coordinates: * yx (yx) MultiIndex - y (yx) object 'a' 'a' 'a' 'b' 'b' 'b' - x (yx) int64 1 2 3 1 2 3 Data variables: foo (yx) int64 1 2 3 4 5 6 ``` # What we cannot do now With the current implementation, we can do ```python ds.sel(y='a').rolling(x=2) ``` but with this PR we cannot, because `x` is not yet an ordinary coordinate, but a MultiIndex with a single `index-level`. I think it is better if we can handle such a MultiIndex with a single `index-level` as very similar way to an ordinary coordinate. Similary, we can neither do `ds.sel(y='a').mean(dim='x')`. Also, `ds.sel(y='a').to_netcdf('file')` (#719) # What are to be decided + How to `repr` these new levels (Current formatting is shown in Out[2] and Out[3] above.) + Terminologies such as `index-level`, `scalar-level`, `MultiIndex-scalar` are clear enough? + How much operations should we support for a single `index-level` MultiIndex? Do we support `ds.sel(y='a').rolling(x=2)` and `ds.sel(y='a').mean(dim='x')`? # TODOs - [ ] Support indexing with DataAarray, `ds.sel(x=ds.x[0])` - [ ] Support `stack`, `unstack`, `set_index`, `reset_index` methods with `scalar-level` MultiIndex. - [ ] Add a full document - [ ] Clean up the code related to MultiIndex - [ ] Fix issues (#1428, #1430, #1431) related to MultiIndex | 2017-05-25T11:03:05Z | 2019-01-14T21:20:28Z | 2019-01-14T21:20:27Z | 5821b1de3713a3513bdce890e77999fd4c4b0688 | 0 | 38dbbbca748b0f22d1c49d63e5e5524ac093295f | bb87a9441d22b390e069d0fde58f297a054fd98a | MEMBER | 13221727 | https://github.com/pydata/xarray/pull/1426 |
Links from other tables
- 4 rows from pull_requests_id in labels_pull_requests