home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

6 rows where issue = 759709924 and user = 2448579 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • dcherian · 6 ✖

issue 1

  • Fancy indexing a Dataset with dask DataArray triggers multiple computes · 6 ✖

author_association 1

  • MEMBER 6
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
802953719 https://github.com/pydata/xarray/issues/4663#issuecomment-802953719 https://api.github.com/repos/pydata/xarray/issues/4663 MDEyOklzc3VlQ29tbWVudDgwMjk1MzcxOQ== dcherian 2448579 2021-03-19T16:23:32Z 2021-03-19T16:23:32Z MEMBER

currently calling the dask array method compute_chunk_sizes() is inefficient for n-d arrays. Raised here: dask/dask#7416

ouch. thanks for raising that issue.

I'm currently using this hacked implementation of a compress() style function that operates on an xarray dataset, which includes more efficient computation of chunk sizes.

I think we'd be open to adding a .compress for Dataset, DataArray, & Variable without the chunks hack :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924
802058819 https://github.com/pydata/xarray/issues/4663#issuecomment-802058819 https://api.github.com/repos/pydata/xarray/issues/4663 MDEyOklzc3VlQ29tbWVudDgwMjA1ODgxOQ== dcherian 2448579 2021-03-18T16:15:19Z 2021-03-18T16:15:19Z MEMBER

I would start by trying to fix

``` python import dask.array as da import numpy as np from xarray.tests import raise_if_dask_computes

with raise_if_dask_computes(max_computes=0): ds = xr.Dataset( dict( a=("x", da.from_array(np.random.randint(0, 100, 100))), b=(("x", "y"), da.random.random((100, 10))), ) ) ds.b.sel(x=ds.a.data) ```

specifically this np.asarray call ~/work/python/xarray/xarray/core/dataset.py in _validate_indexers(self, indexers, missing_dims) 1960 yield k, np.empty((0,), dtype="int64") 1961 else: -> 1962 v = np.asarray(v) 1963 1964 if v.dtype.kind in "US":

Then the next issue is the multiple computes that happen when we pass a DataArray instead of a dask.array (ds.b.sel(x=ds.a)) In general, I guess we can only support this for unindexed dimensions since we wouldn't know what index labels to use without computing the indexer array. This seems to be the case in that colab notebook.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924
802050621 https://github.com/pydata/xarray/issues/4663#issuecomment-802050621 https://api.github.com/repos/pydata/xarray/issues/4663 MDEyOklzc3VlQ29tbWVudDgwMjA1MDYyMQ== dcherian 2448579 2021-03-18T16:05:21Z 2021-03-18T16:05:36Z MEMBER

From @alimanfoo in #5054


I have a dataset comprising several variables. All variables are dask arrays (e.g., backed by zarr). I would like to use one of these variables, which is a 1d boolean array, to index the other variables along a large single dimension. The boolean indexing array is about ~40 million items long, with ~20 million true values.

If I do this all via dask (i.e., not using xarray) then I can index one dask array with another dask array via fancy indexing. The indexing array is not loaded into memory or computed. If I need to know the shape and chunks of the resulting arrays I can call compute_chunk_sizes(), but still very little memory is required.

If I do this via xarray.Dataset.isel() then a substantial amount of memory (several GB) is allocated during isel() and retained. This is problematic as in a real-world use case there are many arrays to be indexed and memory runs out on standard systems.

There is a follow-on issue which is if I then want to run a computation over one of the indexed arrays, if the indexing was done via xarray then that leads to a further blow-up of multiple GB of memory usage, if using dask distributed cluster.

I think the underlying issue here is that the indexing array is loaded into memory, and then gets copied multiple times when the dask graph is constructed. If using a distributed scheduler, further copies get made during scheduling of any subsequent computation.

I made a notebook which illustrates the increased memory usage during Dataset.isel() here: colab.research.google.com/drive/1bn7Sj0An7TehwltWizU8j_l2OvPeoJyo?usp=sharing

This is possibly the same underlying issue (and use case) as raised by @eric-czech in #4663, so feel free to close this if you think it's a duplicate.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924
740958441 https://github.com/pydata/xarray/issues/4663#issuecomment-740958441 https://api.github.com/repos/pydata/xarray/issues/4663 MDEyOklzc3VlQ29tbWVudDc0MDk1ODQ0MQ== dcherian 2448579 2020-12-08T20:09:27Z 2020-12-08T20:09:27Z MEMBER

I think the solution is to handle this case (dask-backed DataArray) in _isel_fancy in dataset.py.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924
740941441 https://github.com/pydata/xarray/issues/4663#issuecomment-740941441 https://api.github.com/repos/pydata/xarray/issues/4663 MDEyOklzc3VlQ29tbWVudDc0MDk0MTQ0MQ== dcherian 2448579 2020-12-08T19:58:51Z 2020-12-08T20:02:55Z MEMBER

Thanks for the great example!

This looks like a duplicate of https://github.com/pydata/xarray/issues/2801. If you agree, can we move the conversation there?

I like using our raise_if_dask_computes context since it points out where the compute is happening

``` python import dask.array as da import numpy as np

from xarray.tests import raise_if_dask_computes

Use a custom array type to know when data is being evaluated

class Array():

def __init__(self, x):
    self.shape = (x.shape[0],)
    self.ndim = x.ndim
    self.dtype = 'bool'
    self.x = x

def __getitem__(self, idx):
    if idx[0].stop > 0:
        print('Evaluating')
    return (self.x > .5).__getitem__(idx)

with raise_if_dask_computes(max_computes=1): ds = xr.Dataset(dict( a=('x', da.from_array(Array(np.random.rand(100)))), b=(('x', 'y'), da.random.random((100, 10))), c=(('x', 'y'), da.random.random((100, 10))), d=(('x', 'y'), da.random.random((100, 10))), )) ds.sel(x=ds.a) ```

```python

RuntimeError Traceback (most recent call last) <ipython-input-76-8efd3a1c3fe5> in <module> 26 d=(('x', 'y'), da.random.random((100, 10))), 27 )) ---> 28 ds.sel(x=ds.a)

/project/mrgoodbar/dcherian/python/xarray/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs) 2211 self, indexers=indexers, method=method, tolerance=tolerance 2212 ) -> 2213 result = self.isel(indexers=pos_indexers, drop=drop) 2214 return result._overwrite_indexes(new_indexes) 2215

/project/mrgoodbar/dcherian/python/xarray/xarray/core/dataset.py in isel(self, indexers, drop, missing_dims, **indexers_kwargs) 2058 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "isel") 2059 if any(is_fancy_indexer(idx) for idx in indexers.values()): -> 2060 return self._isel_fancy(indexers, drop=drop, missing_dims=missing_dims) 2061 2062 # Much faster algorithm for when all indexers are ints, slices, one-dimensional

/project/mrgoodbar/dcherian/python/xarray/xarray/core/dataset.py in _isel_fancy(self, indexers, drop, missing_dims) 2122 indexes[name] = new_index 2123 elif var_indexers: -> 2124 new_var = var.isel(indexers=var_indexers) 2125 else: 2126 new_var = var.copy(deep=False)

/project/mrgoodbar/dcherian/python/xarray/xarray/core/variable.py in isel(self, indexers, missing_dims, **indexers_kwargs) 1118 1119 key = tuple(indexers.get(dim, slice(None)) for dim in self.dims) -> 1120 return self[key] 1121 1122 def squeeze(self, dim=None):

/project/mrgoodbar/dcherian/python/xarray/xarray/core/variable.py in getitem(self, key) 766 array x.values directly. 767 """ --> 768 dims, indexer, new_order = self._broadcast_indexes(key) 769 data = as_indexable(self._data)[indexer] 770 if new_order:

/project/mrgoodbar/dcherian/python/xarray/xarray/core/variable.py in _broadcast_indexes(self, key) 625 dims.append(d) 626 if len(set(dims)) == len(dims): --> 627 return self._broadcast_indexes_outer(key) 628 629 return self._broadcast_indexes_vectorized(key)

/project/mrgoodbar/dcherian/python/xarray/xarray/core/variable.py in _broadcast_indexes_outer(self, key) 680 k = k.data 681 if not isinstance(k, BASIC_INDEXING_TYPES): --> 682 k = np.asarray(k) 683 if k.size == 0: 684 # Slice by empty list; numpy could not infer the dtype

~/miniconda3/envs/dcpy_old_dask/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order) 83 84 """ ---> 85 return array(a, dtype, copy=False, order=order) 86 87

~/miniconda3/envs/dcpy_old_dask/lib/python3.7/site-packages/dask/array/core.py in array(self, dtype, kwargs) 1374 1375 def array(self, dtype=None, kwargs): -> 1376 x = self.compute() 1377 if dtype and x.dtype != dtype: 1378 x = x.astype(dtype)

~/miniconda3/envs/dcpy_old_dask/lib/python3.7/site-packages/dask/base.py in compute(self, kwargs) 165 dask.base.compute 166 """ --> 167 (result,) = compute(self, traverse=False, kwargs) 168 return result 169

~/miniconda3/envs/dcpy_old_dask/lib/python3.7/site-packages/dask/base.py in compute(args, kwargs) 450 postcomputes.append(x.dask_postcompute()) 451 --> 452 results = schedule(dsk, keys, kwargs) 453 return repack([f(r, a) for r, (f, a) in zip(results, postcomputes)]) 454

/project/mrgoodbar/dcherian/python/xarray/xarray/tests/init.py in call(self, dsk, keys, kwargs) 112 raise RuntimeError( 113 "Too many computes. Total: %d > max: %d." --> 114 % (self.total_computes, self.max_computes) 115 ) 116 return dask.get(dsk, keys, kwargs)

RuntimeError: Too many computes. Total: 2 > max: 1. ```

So here it looks like we don't support indexing by dask arrays, so as we loop through the dataset the indexer array gets computed each time. As you say, this should be avoided.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924
740945474 https://github.com/pydata/xarray/issues/4663#issuecomment-740945474 https://api.github.com/repos/pydata/xarray/issues/4663 MDEyOklzc3VlQ29tbWVudDc0MDk0NTQ3NA== dcherian 2448579 2020-12-08T20:01:38Z 2020-12-08T20:01:38Z MEMBER

I commented too soon.

ds.sel(x=ds.a.data)

only computes once (!) so there's something else going on possibly

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 527.922ms · About: xarray-datasette