home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 327882144

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1560#issuecomment-327882144 https://api.github.com/repos/pydata/xarray/issues/1560 327882144 MDEyOklzc3VlQ29tbWVudDMyNzg4MjE0NA== 1217238 2017-09-07T18:19:03Z 2017-09-07T18:19:03Z MEMBER

Indeed, unstack does seem to be quite slow on large dimensions. For 1000x1000, I measure only 10ms to stack, but 4 seconds to unstack: %time arr = DataArray(np.empty([1, 1000, 1000])).stack(flat_dim=['dim_1', 'dim_2']) %time arr.unstack('flat_dim')

Profiling suggests the culprit is the reindex call in unstack(): https://github.com/pydata/xarray/blob/98a05f11c6f38489c82e86c9e9df796e7fb65fd2/xarray/core/dataset.py#L1896

And, in turn, the call to pandas.MultiIndex.get_indexer(). To reproduce with pure pandas: ``` idx1 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) idx2 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) %time idx1.get_indexer(idx2)

CPU times: user 4.1 s, sys: 128 ms, total: 4.23 s

Wall time: 4.41 s

```

We do need this reindex for correctness, but we should have a separate fast-path of some sort (either here or in pandas) to speed this up when the two indexes are identical. For example, note: ``` idx1 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) idx2 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) %time idx1.equals(idx2)

CPU times: user 19 ms, sys: 0 ns, total: 19 ms

Wall time: 18.5 ms

```

I'll file an issue on the pandas tracker.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  255989233
Powered by Datasette · Queries took 0.684ms · About: xarray-datasette