home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

4 rows where issue = 1037894157 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 2

  • TomAugspurger 2
  • dcherian 2

issue 1

  • Slow performance of `DataArray.unstack()` from checking `variable.data` · 4 ✖

author_association 1

  • MEMBER 4
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
953379569 https://github.com/pydata/xarray/issues/5902#issuecomment-953379569 https://api.github.com/repos/pydata/xarray/issues/5902 IC_kwDOAMm_X84402rx TomAugspurger 1312546 2021-10-27T23:19:49Z 2021-10-27T23:19:49Z MEMBER

Thanks @dcherian, that seems to fix this performance problem. I'll see if the tests pass and will submit a PR.

I came across #5582 while searching, thanks :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Slow performance of `DataArray.unstack()` from checking `variable.data` 1037894157
953351408 https://github.com/pydata/xarray/issues/5902#issuecomment-953351408 https://api.github.com/repos/pydata/xarray/issues/5902 IC_kwDOAMm_X8440vzw dcherian 2448579 2021-10-27T22:16:17Z 2021-10-27T22:18:33Z MEMBER

(warning: untested code)

Instead of looking at all of self.variables we could ``` python nonindexes = set(self.variables) - set(self.indexes)

or alternatively make a list of multiindex variables names and exclude those

then the condition becomes

any(is_duck_dask_array(self.variables[v].data) for v in nonindexes) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Slow performance of `DataArray.unstack()` from checking `variable.data` 1037894157
953352129 https://github.com/pydata/xarray/issues/5902#issuecomment-953352129 https://api.github.com/repos/pydata/xarray/issues/5902 IC_kwDOAMm_X8440v_B dcherian 2448579 2021-10-27T22:17:39Z 2021-10-27T22:17:39Z MEMBER

PS: It doesn't seem like the bottleneck in your case but #5582 has an alternative proposal for unstacking dask arrays.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Slow performance of `DataArray.unstack()` from checking `variable.data` 1037894157
953344052 https://github.com/pydata/xarray/issues/5902#issuecomment-953344052 https://api.github.com/repos/pydata/xarray/issues/5902 IC_kwDOAMm_X8440uA0 TomAugspurger 1312546 2021-10-27T22:02:58Z 2021-10-27T22:03:35Z MEMBER

Oh, hmm... I'm noticing now that IndexVariable (currently) eagerly loads data into memory, so that check will always be false for the problematic IndexVariable variable.

So perhaps a slight adjustment to is_duck_dask_array to handle xarray.Variable ?

```diff diff --git a/xarray/core/dataset.py b/xarray/core/dataset.py index 550c3587..16637574 100644 --- a/xarray/core/dataset.py +++ b/xarray/core/dataset.py @@ -4159,14 +4159,14 @@ class Dataset(DataWithCoords, DatasetArithmetic, Mapping): # Dask arrays don't support assignment by index, which the fast unstack # function requires. # https://github.com/pydata/xarray/pull/4746#issuecomment-753282125 - any(is_duck_dask_array(v.data) for v in self.variables.values()) + any(is_duck_dask_array(v) for v in self.variables.values()) # Sparse doesn't currently support (though we could special-case # it) # https://github.com/pydata/sparse/issues/422 - or any( - isinstance(v.data, sparse_array_type) - for v in self.variables.values() - ) + # or any( + # isinstance(v.data, sparse_array_type) + # for v in self.variables.values() + # ) or sparse # Until https://github.com/pydata/xarray/pull/4751 is resolved, # we check explicitly whether it's a numpy array. Once that is @@ -4177,9 +4177,9 @@ class Dataset(DataWithCoords, DatasetArithmetic, Mapping): # # or any( # # isinstance(v.data, pint_array_type) for v in self.variables.values() # # ) - or any( - not isinstance(v.data, np.ndarray) for v in self.variables.values() - ) + # or any( + # not isinstance(v.data, np.ndarray) for v in self.variables.values() + # ) ): result = result._unstack_full_reindex(dim, fill_value, sparse) else: diff --git a/xarray/core/pycompat.py b/xarray/core/pycompat.py index d1649235..e9669105 100644 --- a/xarray/core/pycompat.py +++ b/xarray/core/pycompat.py @@ -44,6 +44,12 @@ class DuckArrayModule:

def is_duck_dask_array(x): + from xarray.core.variable import IndexVariable, Variable + if isinstance(x, IndexVariable): + return False + elif isinstance(x, Variable): + x = x.data + if DuckArrayModule("dask").available: from dask.base import is_dask_collection ```

That's completely ignoring the accesses to v.data for the sparse and pint checks, which don't look quite as easy to solve.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Slow performance of `DataArray.unstack()` from checking `variable.data` 1037894157

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.498ms · About: xarray-datasette