home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 206241490

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
206241490 MDU6SXNzdWUyMDYyNDE0OTA= 1255 PeriodIndex causes severe slow down 5635139 closed 0     1 2017-02-08T16:01:16Z 2017-02-09T15:31:15Z 2017-02-09T15:31:15Z MEMBER      

I need some guidance on how to handle this.

Background

PeriodIndex has a 'non-numpy' dtype now: ```python In [2]: i = pd.PeriodIndex(start=2000, freq='A', periods=10)

In [3]: i Out[3]: PeriodIndex(['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009'], dtype='period[A-DEC]', freq='A-DEC')

In [6]: i.dtype Out[6]: period[A-DEC] ```

When .values or .__array__() are called, the Periods are boxed, which is really slow. The underlying ints are stored in ._values:

```python In [25]: i.values Out[25]: array([Period('2000', 'A-DEC'), Period('2001', 'A-DEC'), Period('2002', 'A-DEC'), Period('2003', 'A-DEC'), Period('2004', 'A-DEC'), Period('2005', 'A-DEC'), Period('2006', 'A-DEC'), Period('2007', 'A-DEC'), Period('2008', 'A-DEC'), Period('2009', 'A-DEC')], dtype=object)

In [27]: all(i.array()==i.values) Out[27]: True

underlying:

In [28]: i._values Out[28]: array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39]) ```

Problem

In pandas, we limit directly calling .values from outside Index, instead accessing Index functions through a smaller API.

But in xarray, I think there are a fair few functions that call .values or implicitly call .__array__() by passing the index into numpy.

As a result, there is a severe slow down when using PeriodIndex. As an example:

```python In [51]: indexes = [pd.PeriodIndex(start=str((1776 + i)), freq='A', periods=300) for i in range(50)] In [53]: das = [xr.DataArray(range(300), coords=[index]) for index in indexes]

In [54]: %timeit xr.concat(das)

1 loop, best of 3: 1.38 s per loop

```

vs DTI:

```python In [55]: indexes_dt = [pd.DatetimeIndex(start=str((1776 + i)), freq='A', periods=300) for i in range(50)] In [56]: das_dt = [xr.DataArray(range(300), coords=[index]) for index in indexes_dt] In [57]: %timeit xr.concat(das_dt)

10 loops, best of 3: 69.2 ms per loop

```

...a 20x slowdown, on fairly short indexes

@shoyer do you have any ideas of how to resolve this? Is it feasible to not pass Indexes directly into numpy? I haven't gone through in enough depth to have a view there, given I was hoping you could cut through the options. Thank you.

ref https://github.com/pandas-dev/pandas/issues/14822 CC @sinhkrs @jreback

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1255/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 1 row from issue in issue_comments
Powered by Datasette · Queries took 0.621ms · About: xarray-datasette