home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 365973662

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
365973662 MDU6SXNzdWUzNjU5NzM2NjI= 2459 Stack + to_array before to_xarray is much faster that a simple to_xarray 5635139 closed 0     13 2018-10-02T16:13:26Z 2020-07-02T20:39:01Z 2020-07-02T20:39:01Z MEMBER      

I was seeing some slow performance around to_xarray() on MultiIndexed series, and found that unstacking one of the dimensions before running to_xarray(), and then restacking with to_array() was ~30x faster. This time difference is consistent with larger data sizes.

To reproduce:

Create a series with a MultiIndex, ensuring the MultiIndex isn't a simple product:

```python s = pd.Series( np.random.rand(100000), index=pd.MultiIndex.from_product([ list('abcdefhijk'), list('abcdefhijk'), pd.DatetimeIndex(start='2000-01-01', periods=1000, freq='B'), ]))

cropped = s[::3] cropped.index=pd.MultiIndex.from_tuples(cropped.index, names=list('xyz'))

cropped.head()

x y z

a a 2000-01-03 0.993989

2000-01-06 0.850518

2000-01-11 0.068944

2000-01-14 0.237197

2000-01-19 0.784254

dtype: float64

```

Two approaches for getting this into xarray; 1 - Simple .to_xarray():

```python

current_method = cropped.to_xarray()

<xarray.DataArray (x: 10, y: 10, z: 1000)> array([[[0.993989, nan, ..., nan, 0.721663], [ nan, nan, ..., 0.58224 , nan], ..., [ nan, 0.369382, ..., nan, nan], [0.98558 , nan, ..., nan, 0.403732]],

   [[     nan,      nan, ..., 0.493711,      nan],
    [     nan, 0.126761, ...,      nan,      nan],
    ...,
    [0.976758,      nan, ...,      nan, 0.816612],
    [     nan,      nan, ..., 0.982128,      nan]],

   ...,

   [[     nan, 0.971525, ...,      nan,      nan],
    [0.146774,      nan, ...,      nan, 0.419806],
    ...,
    [     nan,      nan, ..., 0.700764,      nan],
    [     nan, 0.502058, ...,      nan,      nan]],

   [[0.246768,      nan, ...,      nan, 0.079266],
    [     nan,      nan, ..., 0.802297,      nan],
    ...,
    [     nan, 0.636698, ...,      nan,      nan],
    [0.025195,      nan, ...,      nan, 0.629305]]])

Coordinates: * x (x) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k' * y (y) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k' * z (z) datetime64[ns] 2000-01-03 2000-01-04 ... 2003-10-30 2003-10-31 ```

This takes 536 ms

2 - unstack in pandas first, and then use to_array to do the equivalent of a restack: proposed_version = ( cropped .unstack('y') .to_xarray() .to_array('y') )

This takes 17.3 ms

To confirm these are identical:

``` proposed_version_adj = ( proposed_version .assign_coords(y=proposed_version['y'].astype(object)) .transpose(*current_version.dims) )

proposed_version_adj.equals(current_version)

True

```

Problem description

A default operation is much slower than a (potentially) equivalent operation that's not the default.

I need to look more at what's causing the issues. I think it's to do with the .reindex(full_idx), but I'm unclear why it's so much faster in the alternative route, and whether there's a fix that we can make to make the default path fast.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.14.final.0 python-bits: 64 OS: Linux OS-release: 4.9.93-linuxkit-aufs machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.utf8 LOCALE: None.None xarray: 0.10.9 pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None PseudonetCDF: None rasterio: None iris: None bottleneck: 1.2.1 cyordereddict: None dask: None distributed: None matplotlib: 2.2.3 cartopy: 0.16.0 seaborn: 0.9.0 setuptools: 40.4.3 pip: 18.0 conda: None pytest: 3.8.1 IPython: 5.8.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2459/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 13 rows from issue in issue_comments
Powered by Datasette · Queries took 0.767ms · About: xarray-datasette