home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1495605827

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1495605827 I_kwDOAMm_X85ZJSJD 7376 groupby+map performance regression on MultiIndex dataset 1419010 closed 0     11 2022-12-14T03:56:06Z 2023-08-22T14:47:13Z 2023-08-22T14:47:13Z NONE      

What happened?

We have upgraded to 2022.12.0 version, and noticed a significant performance regression (orders of magnitude) in a code that involves a groupby+map. This seems to be the issue since the 2022.6.0 release, which I understand had a number of changes (including to the groupby code paths) (release notes).

What did you expect to happen?

Fix the performance regression.

Minimal Complete Verifiable Example

```Python import contextlib import os import time from collections.abc import Iterator

import numpy as np import pandas as pd import xarray as xr

@contextlib.contextmanager def log_time(label: str) -> Iterator[None]: """Logs execution time of the context block""" t_0 = time.time() yield print(f"{label} took {time.time() - t_0} seconds")

def main() -> None: m = 100_000 with log_time("creating df"): df = pd.DataFrame( { "i1": [1] * m + [2] * m + [3] * m + [4] * m, "i2": list(range(m)) * 4, "d3": np.random.randint(0, 2, 4 * m).astype(bool), } )

    ds = df.to_xarray().set_coords(["i1", "i2"]).set_index(index=["i1", "i2"])

with log_time("groupby"):

    def per_grp(da: xr.DataArray) -> xr.DataArray:
        return da

    (ds.assign(x=lambda ds: ds["d3"].groupby("i1").map(per_grp)))

if name == "main": main() ```

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

``Python xarray current main2022.12.1.dev7+g021c73e1`, but affects all version since 2022.6.0 (inclusive).

creating df took 0.10657930374145508 seconds groupby took 129.5521149635315 seconds


xarray 2022.3.0:

creating df took 0.09968900680541992 seconds groupby took 0.19161295890808105 seconds ```

Anything else we need to know?

No response

Environment

Environment of the version installed from source (2022.12.1.dev7+g021c73e1):

INSTALLED VERSIONS ------------------ commit: None python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 22.1.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2022.12.1.dev7+g021c73e1 pandas: 1.5.2 numpy: 1.23.5 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.5.1 pip: 22.3.1 conda: None pytest: None mypy: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7376/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 10 rows from issue in issue_comments
Powered by Datasette · Queries took 158.354ms · About: xarray-datasette