html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/1560#issuecomment-412407141,https://api.github.com/repos/pydata/xarray/issues/1560,412407141,MDEyOklzc3VlQ29tbWVudDQxMjQwNzE0MQ==,1217238,2018-08-13T04:44:09Z,2018-08-13T04:44:09Z,MEMBER,"@maahn yes, that would look fine to me. Please add an ASV benchmark so we can monitor this for regressions:
https://github.com/pydata/xarray/tree/master/asv_bench/benchmarks
It would be nice to push this up this optimization into `reindex_variables`, but it's not necessary (and I'm not even sure it could be done as efficiently as the equals check in `unstack`).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,255989233
https://github.com/pydata/xarray/issues/1560#issuecomment-411815694,https://api.github.com/repos/pydata/xarray/issues/1560,411815694,MDEyOklzc3VlQ29tbWVudDQxMTgxNTY5NA==,222557,2018-08-09T16:21:41Z,2018-08-09T16:21:41Z,NONE,"What about a quick fix with `index.equals` like this (without the prints of course): https://github.com/maahn/xarray/commit/cf83991a161fbd89af2029a69cb50f1e09a5ed45. For the example above
arr = xr.DataArray(np.empty([1, 8996, 9223]))
arr = arr.stack(flat_dim=['dim_1', 'dim_2'])
%time arr.unstack('flat_dim')
the modified routine takes 5.75 s in comparison to 6min 40s with xr 0.10.7 and pd 0.23.3. Not sure whether this is related to a newer version, but `index.equals(full_idx)` takes actually only 2e-4 s in that example. When slicing or reordering is applied to the MultiIndex
arr = xr.DataArray(np.arange(20).reshape((1, 10, 2))).stack(flat_dim=['dim_1', 'dim_2'])
arr.isel(flat_dim = [1,2]).unstack('flat_dim')
or
arr = xr.DataArray(np.arange(20).reshape((1, 10, 2))).stack(flat_dim=['dim_1', 'dim_2'])
arr[:,::-1].unstack('flat_dim')
it will fall back to the old method with `reindex`. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,255989233
https://github.com/pydata/xarray/issues/1560#issuecomment-327896138,https://api.github.com/repos/pydata/xarray/issues/1560,327896138,MDEyOklzc3VlQ29tbWVudDMyNzg5NjEzOA==,1217238,2017-09-07T19:12:50Z,2017-09-07T19:12:50Z,MEMBER,Though possibly we should just be using `Index.reindex` directly inside `reindex_variables` (in `xarray/core/alignment.py`) instead of calling `get_indexer`.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,255989233
https://github.com/pydata/xarray/issues/1560#issuecomment-327895477,https://api.github.com/repos/pydata/xarray/issues/1560,327895477,MDEyOklzc3VlQ29tbWVudDMyNzg5NTQ3Nw==,1217238,2017-09-07T19:10:03Z,2017-09-07T19:10:03Z,MEMBER,"@davidh-ssec Yes, but we need it for `MultiIndex.get_indexer`, not `MultiIndex.reindex`:
https://github.com/pandas-dev/pandas/blob/ee6185e2fb9461632949f3ba52a28b37a1f7296e/pandas/core/indexes/multi.py#L1781","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,255989233
https://github.com/pydata/xarray/issues/1560#issuecomment-327894887,https://api.github.com/repos/pydata/xarray/issues/1560,327894887,MDEyOklzc3VlQ29tbWVudDMyNzg5NDg4Nw==,1828519,2017-09-07T19:07:40Z,2017-09-07T19:07:40Z,CONTRIBUTOR,"@shoyer As for the equals shortcut, isn't that what this line is doing: https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/multi.py#L1864","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,255989233
https://github.com/pydata/xarray/issues/1560#issuecomment-327891893,https://api.github.com/repos/pydata/xarray/issues/1560,327891893,MDEyOklzc3VlQ29tbWVudDMyNzg5MTg5Mw==,167802,2017-09-07T18:55:39Z,2017-09-07T18:55:39Z,CONTRIBUTOR,"Yes, I have the latest version, still takes some time with a 9000x9000 array:
```
In [4]: %time arr.unstack('flat_dim')
CPU times: user 26.1 s, sys: 7.8 s, total: 33.9 s
Wall time: 35.3 s
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,255989233
https://github.com/pydata/xarray/issues/1560#issuecomment-327890644,https://api.github.com/repos/pydata/xarray/issues/1560,327890644,MDEyOklzc3VlQ29tbWVudDMyNzg5MDY0NA==,1217238,2017-09-07T18:50:50Z,2017-09-07T18:50:50Z,MEMBER,"The MultiIndex speed/memory improvements seem to be around even in pandas 0.20.3, the latest release. So definitely make sure your pandas install is up to date here.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,255989233
https://github.com/pydata/xarray/issues/1560#issuecomment-327886998,https://api.github.com/repos/pydata/xarray/issues/1560,327886998,MDEyOklzc3VlQ29tbWVudDMyNzg4Njk5OA==,1217238,2017-09-07T18:36:48Z,2017-09-07T18:36:48Z,MEMBER,"This is still somewhat annoyingly slow, but for a 8000 x 9000 MultiIndex on pandas 0.21-dev, I measure 41 seconds for `get_indexer()` vs 3.8 seconds for `equals()`.
So a fast-path might still be a good idea, but to get to truly interactive speeds, we might need a faster way to validate a MultiIndex as equal to the outer-product of its levels. Potentially we could save some metadata in `PandasIndexAdapter` as part of `stack()` to indicate that the levels are from an outer product:
https://github.com/pydata/xarray/blob/98a05f11c6f38489c82e86c9e9df796e7fb65fd2/xarray/core/indexing.py#L502-L505","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,255989233
https://github.com/pydata/xarray/issues/1560#issuecomment-327884467,https://api.github.com/repos/pydata/xarray/issues/1560,327884467,MDEyOklzc3VlQ29tbWVudDMyNzg4NDQ2Nw==,1217238,2017-09-07T18:27:27Z,2017-09-07T18:27:27Z,MEMBER,"Actually, the timings above were with pandas 0.19. It's still somewhat slow using the dev version of pandas, but it's more like 10x slower rather than 100x slower:
```
In [4]: idx1 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])
...: idx2 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])
...: %time idx1.get_indexer(idx2)
...:
CPU times: user 215 ms, sys: 81.8 ms, total: 297 ms
Wall time: 319 ms
Out[4]: array([ 0, 1, 2, ..., 999997, 999998, 999999])
In [5]: idx1 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])
...: idx2 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])
...: %time idx1.equals(idx2)
...:
CPU times: user 19.8 ms, sys: 9.29 ms, total: 29.1 ms
Wall time: 32.1 ms
Out[5]: True
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,255989233
https://github.com/pydata/xarray/issues/1560#issuecomment-327882144,https://api.github.com/repos/pydata/xarray/issues/1560,327882144,MDEyOklzc3VlQ29tbWVudDMyNzg4MjE0NA==,1217238,2017-09-07T18:19:03Z,2017-09-07T18:19:03Z,MEMBER,"Indeed, unstack does seem to be quite slow on large dimensions. For 1000x1000, I measure only 10ms to stack, but 4 seconds to unstack:
```
%time arr = DataArray(np.empty([1, 1000, 1000])).stack(flat_dim=['dim_1', 'dim_2'])
%time arr.unstack('flat_dim')
```
Profiling suggests the culprit is the `reindex` call in `unstack()`:
https://github.com/pydata/xarray/blob/98a05f11c6f38489c82e86c9e9df796e7fb65fd2/xarray/core/dataset.py#L1896
And, in turn, the call to `pandas.MultiIndex.get_indexer()`. To reproduce with pure pandas:
```
idx1 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])
idx2 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])
%time idx1.get_indexer(idx2)
# CPU times: user 4.1 s, sys: 128 ms, total: 4.23 s
# Wall time: 4.41 s
```
We do need this reindex for correctness, but we should have a separate fast-path of some sort (either here or in pandas) to speed this up when the two indexes are identical. For example, note:
```
idx1 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])
idx2 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])
%time idx1.equals(idx2)
# CPU times: user 19 ms, sys: 0 ns, total: 19 ms
# Wall time: 18.5 ms
```
I'll file an issue on the pandas tracker.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,255989233
https://github.com/pydata/xarray/issues/1560#issuecomment-327849071,https://api.github.com/repos/pydata/xarray/issues/1560,327849071,MDEyOklzc3VlQ29tbWVudDMyNzg0OTA3MQ==,1828519,2017-09-07T16:15:06Z,2017-09-07T16:15:06Z,CONTRIBUTOR,I was able to reproduce this on my mac by watching Activity Monitor and saw a peak of ~8GB of memory during the `unstack` call.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,255989233