issues: 117039129
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
117039129 | MDU6SXNzdWUxMTcwMzkxMjk= | 659 | groupby very slow compared to pandas | 1322974 | closed | 0 | 9 | 2015-11-16T02:43:57Z | 2022-05-15T02:38:30Z | 2022-05-15T02:38:30Z | CONTRIBUTOR | ``` import timeit import numpy as np from pandas import DataFrame from xray import Dataset, DataArray df = DataFrame({"a": np.r_[np.arange(500.), np.arange(500.)], "b": np.arange(1000.)}) print(timeit.repeat('df.groupby("a").agg("mean")', globals={"df": df}, number=10)) print(timeit.repeat('df.groupby("a").agg(np.mean)', globals={"df": df, "np": np}, number=10)) ds = Dataset({"a": DataArray(np.r_[np.arange(500.), np.arange(500.)]), "b": DataArray(np.arange(1000.))}) print(timeit.repeat('ds.groupby("a").mean()', globals={"ds": ds}, number=10)) ``` This outputs
i.e. xray's groupby is ~100 times slower than pandas' one (and 200 times slower than passing (This is the actual order or magnitude of the data size and redundancy I want to handle, i.e. thousands of points with very limited duplication.) |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/659/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |