home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

11 rows where issue = 255989233 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • shoyer 7
  • djhoese 2
  • mraspaud 1
  • maahn 1

author_association 3

  • MEMBER 7
  • CONTRIBUTOR 3
  • NONE 1

issue 1

  • DataArray.unstack taking unreasonable amounts of memory · 11 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
412407141 https://github.com/pydata/xarray/issues/1560#issuecomment-412407141 https://api.github.com/repos/pydata/xarray/issues/1560 MDEyOklzc3VlQ29tbWVudDQxMjQwNzE0MQ== shoyer 1217238 2018-08-13T04:44:09Z 2018-08-13T04:44:09Z MEMBER

@maahn yes, that would look fine to me. Please add an ASV benchmark so we can monitor this for regressions: https://github.com/pydata/xarray/tree/master/asv_bench/benchmarks

It would be nice to push this up this optimization into reindex_variables, but it's not necessary (and I'm not even sure it could be done as efficiently as the equals check in unstack).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.unstack taking unreasonable amounts of memory 255989233
411815694 https://github.com/pydata/xarray/issues/1560#issuecomment-411815694 https://api.github.com/repos/pydata/xarray/issues/1560 MDEyOklzc3VlQ29tbWVudDQxMTgxNTY5NA== maahn 222557 2018-08-09T16:21:41Z 2018-08-09T16:21:41Z NONE

What about a quick fix with index.equals like this (without the prints of course): https://github.com/maahn/xarray/commit/cf83991a161fbd89af2029a69cb50f1e09a5ed45. For the example above

arr = xr.DataArray(np.empty([1, 8996, 9223]))
arr = arr.stack(flat_dim=['dim_1', 'dim_2'])
%time arr.unstack('flat_dim')

the modified routine takes 5.75 s in comparison to 6min 40s with xr 0.10.7 and pd 0.23.3. Not sure whether this is related to a newer version, but index.equals(full_idx) takes actually only 2e-4 s in that example. When slicing or reordering is applied to the MultiIndex

arr = xr.DataArray(np.arange(20).reshape((1, 10, 2))).stack(flat_dim=['dim_1', 'dim_2'])
arr.isel(flat_dim = [1,2]).unstack('flat_dim')

or

arr = xr.DataArray(np.arange(20).reshape((1, 10, 2))).stack(flat_dim=['dim_1', 'dim_2'])
arr[:,::-1].unstack('flat_dim')

it will fall back to the old method with reindex.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.unstack taking unreasonable amounts of memory 255989233
327896138 https://github.com/pydata/xarray/issues/1560#issuecomment-327896138 https://api.github.com/repos/pydata/xarray/issues/1560 MDEyOklzc3VlQ29tbWVudDMyNzg5NjEzOA== shoyer 1217238 2017-09-07T19:12:50Z 2017-09-07T19:12:50Z MEMBER

Though possibly we should just be using Index.reindex directly inside reindex_variables (in xarray/core/alignment.py) instead of calling get_indexer.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.unstack taking unreasonable amounts of memory 255989233
327895477 https://github.com/pydata/xarray/issues/1560#issuecomment-327895477 https://api.github.com/repos/pydata/xarray/issues/1560 MDEyOklzc3VlQ29tbWVudDMyNzg5NTQ3Nw== shoyer 1217238 2017-09-07T19:10:03Z 2017-09-07T19:10:03Z MEMBER

@davidh-ssec Yes, but we need it for MultiIndex.get_indexer, not MultiIndex.reindex: https://github.com/pandas-dev/pandas/blob/ee6185e2fb9461632949f3ba52a28b37a1f7296e/pandas/core/indexes/multi.py#L1781

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.unstack taking unreasonable amounts of memory 255989233
327894887 https://github.com/pydata/xarray/issues/1560#issuecomment-327894887 https://api.github.com/repos/pydata/xarray/issues/1560 MDEyOklzc3VlQ29tbWVudDMyNzg5NDg4Nw== djhoese 1828519 2017-09-07T19:07:40Z 2017-09-07T19:07:40Z CONTRIBUTOR

@shoyer As for the equals shortcut, isn't that what this line is doing: https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/multi.py#L1864

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.unstack taking unreasonable amounts of memory 255989233
327891893 https://github.com/pydata/xarray/issues/1560#issuecomment-327891893 https://api.github.com/repos/pydata/xarray/issues/1560 MDEyOklzc3VlQ29tbWVudDMyNzg5MTg5Mw== mraspaud 167802 2017-09-07T18:55:39Z 2017-09-07T18:55:39Z CONTRIBUTOR

Yes, I have the latest version, still takes some time with a 9000x9000 array: In [4]: %time arr.unstack('flat_dim') CPU times: user 26.1 s, sys: 7.8 s, total: 33.9 s Wall time: 35.3 s

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.unstack taking unreasonable amounts of memory 255989233
327890644 https://github.com/pydata/xarray/issues/1560#issuecomment-327890644 https://api.github.com/repos/pydata/xarray/issues/1560 MDEyOklzc3VlQ29tbWVudDMyNzg5MDY0NA== shoyer 1217238 2017-09-07T18:50:50Z 2017-09-07T18:50:50Z MEMBER

The MultiIndex speed/memory improvements seem to be around even in pandas 0.20.3, the latest release. So definitely make sure your pandas install is up to date here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.unstack taking unreasonable amounts of memory 255989233
327886998 https://github.com/pydata/xarray/issues/1560#issuecomment-327886998 https://api.github.com/repos/pydata/xarray/issues/1560 MDEyOklzc3VlQ29tbWVudDMyNzg4Njk5OA== shoyer 1217238 2017-09-07T18:36:48Z 2017-09-07T18:36:48Z MEMBER

This is still somewhat annoyingly slow, but for a 8000 x 9000 MultiIndex on pandas 0.21-dev, I measure 41 seconds for get_indexer() vs 3.8 seconds for equals().

So a fast-path might still be a good idea, but to get to truly interactive speeds, we might need a faster way to validate a MultiIndex as equal to the outer-product of its levels. Potentially we could save some metadata in PandasIndexAdapter as part of stack() to indicate that the levels are from an outer product: https://github.com/pydata/xarray/blob/98a05f11c6f38489c82e86c9e9df796e7fb65fd2/xarray/core/indexing.py#L502-L505

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.unstack taking unreasonable amounts of memory 255989233
327884467 https://github.com/pydata/xarray/issues/1560#issuecomment-327884467 https://api.github.com/repos/pydata/xarray/issues/1560 MDEyOklzc3VlQ29tbWVudDMyNzg4NDQ2Nw== shoyer 1217238 2017-09-07T18:27:27Z 2017-09-07T18:27:27Z MEMBER

Actually, the timings above were with pandas 0.19. It's still somewhat slow using the dev version of pandas, but it's more like 10x slower rather than 100x slower: ``` In [4]: idx1 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) ...: idx2 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) ...: %time idx1.get_indexer(idx2) ...: CPU times: user 215 ms, sys: 81.8 ms, total: 297 ms Wall time: 319 ms Out[4]: array([ 0, 1, 2, ..., 999997, 999998, 999999])

In [5]: idx1 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) ...: idx2 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) ...: %time idx1.equals(idx2) ...: CPU times: user 19.8 ms, sys: 9.29 ms, total: 29.1 ms Wall time: 32.1 ms Out[5]: True ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.unstack taking unreasonable amounts of memory 255989233
327882144 https://github.com/pydata/xarray/issues/1560#issuecomment-327882144 https://api.github.com/repos/pydata/xarray/issues/1560 MDEyOklzc3VlQ29tbWVudDMyNzg4MjE0NA== shoyer 1217238 2017-09-07T18:19:03Z 2017-09-07T18:19:03Z MEMBER

Indeed, unstack does seem to be quite slow on large dimensions. For 1000x1000, I measure only 10ms to stack, but 4 seconds to unstack: %time arr = DataArray(np.empty([1, 1000, 1000])).stack(flat_dim=['dim_1', 'dim_2']) %time arr.unstack('flat_dim')

Profiling suggests the culprit is the reindex call in unstack(): https://github.com/pydata/xarray/blob/98a05f11c6f38489c82e86c9e9df796e7fb65fd2/xarray/core/dataset.py#L1896

And, in turn, the call to pandas.MultiIndex.get_indexer(). To reproduce with pure pandas: ``` idx1 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) idx2 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) %time idx1.get_indexer(idx2)

CPU times: user 4.1 s, sys: 128 ms, total: 4.23 s

Wall time: 4.41 s

```

We do need this reindex for correctness, but we should have a separate fast-path of some sort (either here or in pandas) to speed this up when the two indexes are identical. For example, note: ``` idx1 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) idx2 = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)]) %time idx1.equals(idx2)

CPU times: user 19 ms, sys: 0 ns, total: 19 ms

Wall time: 18.5 ms

```

I'll file an issue on the pandas tracker.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.unstack taking unreasonable amounts of memory 255989233
327849071 https://github.com/pydata/xarray/issues/1560#issuecomment-327849071 https://api.github.com/repos/pydata/xarray/issues/1560 MDEyOklzc3VlQ29tbWVudDMyNzg0OTA3MQ== djhoese 1828519 2017-09-07T16:15:06Z 2017-09-07T16:15:06Z CONTRIBUTOR

I was able to reproduce this on my mac by watching Activity Monitor and saw a peak of ~8GB of memory during the unstack call.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.unstack taking unreasonable amounts of memory 255989233

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.964ms · About: xarray-datasette