github: issues: 2 rows where repo = 13221727 and user = 1419010 sorted by updated

2 rows where repo = 13221727 and user = 1419010 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at ▲	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1495605827	I_kwDOAMm_X85ZJSJD	7376	groupby+map performance regression on MultiIndex dataset	ravwojdyla 1419010	closed	0			11	2022-12-14T03:56:06Z	2023-08-22T14:47:13Z	2023-08-22T14:47:13Z	NONE				What happened? We have upgraded to 2022.12.0 version, and noticed a significant performance regression (orders of magnitude) in a code that involves a groupby+map. This seems to be the issue since the 2022.6.0 release, which I understand had a number of changes (including to the groupby code paths) (release notes). What did you expect to happen? Fix the performance regression. Minimal Complete Verifiable Example ```Python import contextlib import os import time from collections.abc import Iterator import numpy as np import pandas as pd import xarray as xr @contextlib.contextmanager def log_time(label: str) -> Iterator[None]: """Logs execution time of the context block""" t_0 = time.time() yield print(f"{label} took {time.time() - t_0} seconds") def main() -> None: m = 100_000 with log_time("creating df"): df = pd.DataFrame( { "i1": [1] * m + [2] * m + [3] * m + [4] * m, "i2": list(range(m)) * 4, "d3": np.random.randint(0, 2, 4 * m).astype(bool), } ) `ds = df.to_xarray().set_coords(["i1", "i2"]).set_index(index=["i1", "i2"]) with log_time("groupby"): def per_grp(da: xr.DataArray) -> xr.DataArray: return da (ds.assign(x=lambda ds: ds["d3"].groupby("i1").map(per_grp)))` if name == "main": main() ``` MVCE confirmation [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. [x] Complete example — the example is self-contained, including all data and the text of any traceback. [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result. [x] New issue — a search of GitHub Issues suggests this is not a duplicate. Relevant log output ``Python xarray current main2022.12.1.dev7+g021c73e1`, but affects all version since 2022.6.0 (inclusive). creating df took 0.10657930374145508 seconds groupby took 129.5521149635315 seconds xarray 2022.3.0: creating df took 0.09968900680541992 seconds groupby took 0.19161295890808105 seconds ``` Anything else we need to know? No response Environment Environment of the version installed from source (`2022.12.1.dev7+g021c73e1`): INSTALLED VERSIONS ------------------ commit: None python: 3.10.8 \| packaged by conda-forge \| (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 22.1.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2022.12.1.dev7+g021c73e1 pandas: 1.5.2 numpy: 1.23.5 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.5.1 pip: 22.3.1 conda: None pytest: None mypy: None IPython: None sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7376/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	xarray 13221727	issue
753374426	MDU6SXNzdWU3NTMzNzQ0MjY=	4623	Allow chunk spec per variable	ravwojdyla 1419010	open	0			3	2020-11-30T10:56:39Z	2020-12-19T17:17:23Z		NONE				Say, I have a zarr dataset with multiple variables `Foo`, `Bar` and `Baz` (and potentially, many more), there are 2 dimensions: `x`, `y` (potentially more). Say both `Foo` and `Bar` are large 2d arrays dims: `x, y`, `Baz` is relatively small 1d array dim: `y`. Say I would like to read that dataset with xarray but increase chunk from the native zarr chunk size for `x` and `y` but only for `Foo` and `Bar`, I would like to keep native chunking for `Baz`. afaiu currently I would do that with `chunks` parameter to `open_dataset`/`open_zarr`, but if I do do that via say `dict(x=N, y=M)` that will change chunking for all variables that use those dimensions, which isn't exactly what I need, I need those changed only for `Foo` and `Bar`. Is there a way to do that? Should that be part of the "harmonisation"? One could imagine that xarray could accept a dict of dict akin to `{var: {dim: chunk_spec}}` to specify chunking for specific variables. Note that `rechunk` after reading is not what I want, I would like to specify chunking at read op. Originally posted by @ravwojdyla in https://github.com/pydata/xarray/issues/4496#issuecomment-732486436	{ "url": "https://api.github.com/repos/pydata/xarray/issues/4623/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

2 rows where repo = 13221727 and user = 1419010 sorted by updated_at descending

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

Advanced export