github: issues: 1 row where "created_at" is on date 2021-09-20 and user = 2448579 sorted by updated

1 row where "created_at" is on date 2021-09-20 and user = 2448579 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at ▲	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1001197796	I_kwDOAMm_X847rRDk	5804	vectorized groupby binary ops	dcherian 2448579	closed	0			1	2021-09-20T17:04:47Z	2022-03-29T07:11:28Z	2022-03-29T07:11:28Z	MEMBER				By switching to `numpy_groupies` we are vectorizing our groupby reductions. I think we can do the same for groupby's binary ops. Here's an example array ``` python import numpy as np import xarray as xr %load_ext memory_profiler N = 4 * 2000 da = xr.DataArray( np.random.random((N, N)), dims=("x", "y"), coords={"labels": ("x", np.repeat(["a", "b", "c", "d", "e", "f", "g", "h"], repeats=N//8))}, ) ``` Consider this "anomaly" calculation, anomaly defined relative to the group mean ``` python def anom_current(da): grouped = da.groupby("labels") mean = grouped.mean() anom = grouped - mean return anom ``` With this approach, we loop over each group and apply the binary operation: https://github.com/pydata/xarray/blob/a1635d324753588e353e4e747f6058936fa8cf1e/xarray/core/computation.py#L502-L525 This saves some memory, but becomes slow for large number of groups. We could instead do `def anom_vectorized(da): mean = da.groupby("labels").mean() mean_expanded = mean.sel(labels=da.labels) anom = da - mean_expanded return anom` Now we are faster, but construct an extra array as big as the original array (I think this is an OK tradeoff). ``` %timeit anom_current(da) 1.4 s ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit anom_vectorized(da) 937 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` (I haven't experimented with dask yet, so the following is just a theory). I think the real benefit comes with dask. Depending on where the groups are located relative to chunking, we could end up creating a lot of tiny chunks by splitting up existing chunks. With the vectorized approach we can do better. Ideally we would reindex the "mean" dask array with a numpy-array-of-repeated-ints such that the chunking of `mean_expanded` exactly matches the chunking of `da` along the grouped dimension. ~In practice, dask.array.take doesn't allow specifying "output chunks" so we'd end up chunking "mean_expanded" based on dask's automatic heuristics, and then rechunking again for the binary operation.~ Thoughts? cc @rabernat	{ "url": "https://api.github.com/repos/pydata/xarray/issues/5804/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

1 row where "created_at" is on date 2021-09-20 and user = 2448579 sorted by updated_at descending

1.4 s ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

937 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```

Advanced export