home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 824459878

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/5202#issuecomment-824459878 https://api.github.com/repos/pydata/xarray/issues/5202 824459878 MDEyOklzc3VlQ29tbWVudDgyNDQ1OTg3OA== 1217238 2021-04-22T00:57:56Z 2021-04-22T00:57:56Z MEMBER

Do we have any ideas on how expensive the MultiIndex creation is as a share of stack?

It depends, but it can easily be 50% to nearly 100% of the runtime. stack() uses reshape() on data variables, which is either free (for arrays that are still contiguous and can use views) or can be delayed until compute-time (with dask). In contrast, the MultiIndex is always created eagerly.

If we use Fortran order arrays, we can get a rough lower bound on the time for MultiIndex creation, e.g., consider: python import xarray import numpy as np a = xarray.DataArray(np.ones((5000, 5000), order='F'), dims=['x', 'y']) %prun a.stack(z=['x', 'y']) Not surprisingly, making the multi-index takes about half the runtime here.

Pandas does delay creating the actual hash-table behind a MultiIndex until it's needed, so I guess the main expense here is just allocating the new coordinate arrays.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  864249974
Powered by Datasette · Queries took 1.847ms · About: xarray-datasette