issue_comments
6 rows where issue = 759709924 and user = 2448579 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: created_at (date), updated_at (date)
issue 1
- Fancy indexing a Dataset with dask DataArray triggers multiple computes · 6 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
802953719 | https://github.com/pydata/xarray/issues/4663#issuecomment-802953719 | https://api.github.com/repos/pydata/xarray/issues/4663 | MDEyOklzc3VlQ29tbWVudDgwMjk1MzcxOQ== | dcherian 2448579 | 2021-03-19T16:23:32Z | 2021-03-19T16:23:32Z | MEMBER |
ouch. thanks for raising that issue.
I think we'd be open to adding a |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924 | |
802058819 | https://github.com/pydata/xarray/issues/4663#issuecomment-802058819 | https://api.github.com/repos/pydata/xarray/issues/4663 | MDEyOklzc3VlQ29tbWVudDgwMjA1ODgxOQ== | dcherian 2448579 | 2021-03-18T16:15:19Z | 2021-03-18T16:15:19Z | MEMBER | I would start by trying to fix ``` python import dask.array as da import numpy as np from xarray.tests import raise_if_dask_computes with raise_if_dask_computes(max_computes=0): ds = xr.Dataset( dict( a=("x", da.from_array(np.random.randint(0, 100, 100))), b=(("x", "y"), da.random.random((100, 10))), ) ) ds.b.sel(x=ds.a.data) ``` specifically this Then the next issue is the multiple computes that happen when we pass a |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924 | |
802050621 | https://github.com/pydata/xarray/issues/4663#issuecomment-802050621 | https://api.github.com/repos/pydata/xarray/issues/4663 | MDEyOklzc3VlQ29tbWVudDgwMjA1MDYyMQ== | dcherian 2448579 | 2021-03-18T16:05:21Z | 2021-03-18T16:05:36Z | MEMBER | From @alimanfoo in #5054 I have a dataset comprising several variables. All variables are dask arrays (e.g., backed by zarr). I would like to use one of these variables, which is a 1d boolean array, to index the other variables along a large single dimension. The boolean indexing array is about ~40 million items long, with ~20 million true values. If I do this all via dask (i.e., not using xarray) then I can index one dask array with another dask array via fancy indexing. The indexing array is not loaded into memory or computed. If I need to know the shape and chunks of the resulting arrays I can call compute_chunk_sizes(), but still very little memory is required. If I do this via xarray.Dataset.isel() then a substantial amount of memory (several GB) is allocated during isel() and retained. This is problematic as in a real-world use case there are many arrays to be indexed and memory runs out on standard systems. There is a follow-on issue which is if I then want to run a computation over one of the indexed arrays, if the indexing was done via xarray then that leads to a further blow-up of multiple GB of memory usage, if using dask distributed cluster. I think the underlying issue here is that the indexing array is loaded into memory, and then gets copied multiple times when the dask graph is constructed. If using a distributed scheduler, further copies get made during scheduling of any subsequent computation. I made a notebook which illustrates the increased memory usage during Dataset.isel() here: colab.research.google.com/drive/1bn7Sj0An7TehwltWizU8j_l2OvPeoJyo?usp=sharing This is possibly the same underlying issue (and use case) as raised by @eric-czech in #4663, so feel free to close this if you think it's a duplicate. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924 | |
740958441 | https://github.com/pydata/xarray/issues/4663#issuecomment-740958441 | https://api.github.com/repos/pydata/xarray/issues/4663 | MDEyOklzc3VlQ29tbWVudDc0MDk1ODQ0MQ== | dcherian 2448579 | 2020-12-08T20:09:27Z | 2020-12-08T20:09:27Z | MEMBER | I think the solution is to handle this case (dask-backed DataArray) in |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924 | |
740941441 | https://github.com/pydata/xarray/issues/4663#issuecomment-740941441 | https://api.github.com/repos/pydata/xarray/issues/4663 | MDEyOklzc3VlQ29tbWVudDc0MDk0MTQ0MQ== | dcherian 2448579 | 2020-12-08T19:58:51Z | 2020-12-08T20:02:55Z | MEMBER | Thanks for the great example! This looks like a duplicate of https://github.com/pydata/xarray/issues/2801. If you agree, can we move the conversation there? I like using our ``` python import dask.array as da import numpy as np from xarray.tests import raise_if_dask_computes Use a custom array type to know when data is being evaluatedclass Array():
with raise_if_dask_computes(max_computes=1): ds = xr.Dataset(dict( a=('x', da.from_array(Array(np.random.rand(100)))), b=(('x', 'y'), da.random.random((100, 10))), c=(('x', 'y'), da.random.random((100, 10))), d=(('x', 'y'), da.random.random((100, 10))), )) ds.sel(x=ds.a) ``` ```pythonRuntimeError Traceback (most recent call last) <ipython-input-76-8efd3a1c3fe5> in <module> 26 d=(('x', 'y'), da.random.random((100, 10))), 27 )) ---> 28 ds.sel(x=ds.a) /project/mrgoodbar/dcherian/python/xarray/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs) 2211 self, indexers=indexers, method=method, tolerance=tolerance 2212 ) -> 2213 result = self.isel(indexers=pos_indexers, drop=drop) 2214 return result._overwrite_indexes(new_indexes) 2215 /project/mrgoodbar/dcherian/python/xarray/xarray/core/dataset.py in isel(self, indexers, drop, missing_dims, **indexers_kwargs) 2058 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "isel") 2059 if any(is_fancy_indexer(idx) for idx in indexers.values()): -> 2060 return self._isel_fancy(indexers, drop=drop, missing_dims=missing_dims) 2061 2062 # Much faster algorithm for when all indexers are ints, slices, one-dimensional /project/mrgoodbar/dcherian/python/xarray/xarray/core/dataset.py in _isel_fancy(self, indexers, drop, missing_dims) 2122 indexes[name] = new_index 2123 elif var_indexers: -> 2124 new_var = var.isel(indexers=var_indexers) 2125 else: 2126 new_var = var.copy(deep=False) /project/mrgoodbar/dcherian/python/xarray/xarray/core/variable.py in isel(self, indexers, missing_dims, **indexers_kwargs) 1118 1119 key = tuple(indexers.get(dim, slice(None)) for dim in self.dims) -> 1120 return self[key] 1121 1122 def squeeze(self, dim=None): /project/mrgoodbar/dcherian/python/xarray/xarray/core/variable.py in getitem(self, key)
766 array /project/mrgoodbar/dcherian/python/xarray/xarray/core/variable.py in _broadcast_indexes(self, key) 625 dims.append(d) 626 if len(set(dims)) == len(dims): --> 627 return self._broadcast_indexes_outer(key) 628 629 return self._broadcast_indexes_vectorized(key) /project/mrgoodbar/dcherian/python/xarray/xarray/core/variable.py in _broadcast_indexes_outer(self, key) 680 k = k.data 681 if not isinstance(k, BASIC_INDEXING_TYPES): --> 682 k = np.asarray(k) 683 if k.size == 0: 684 # Slice by empty list; numpy could not infer the dtype ~/miniconda3/envs/dcpy_old_dask/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order) 83 84 """ ---> 85 return array(a, dtype, copy=False, order=order) 86 87 ~/miniconda3/envs/dcpy_old_dask/lib/python3.7/site-packages/dask/array/core.py in array(self, dtype, kwargs) 1374 1375 def array(self, dtype=None, kwargs): -> 1376 x = self.compute() 1377 if dtype and x.dtype != dtype: 1378 x = x.astype(dtype) ~/miniconda3/envs/dcpy_old_dask/lib/python3.7/site-packages/dask/base.py in compute(self, kwargs) 165 dask.base.compute 166 """ --> 167 (result,) = compute(self, traverse=False, kwargs) 168 return result 169 ~/miniconda3/envs/dcpy_old_dask/lib/python3.7/site-packages/dask/base.py in compute(args, kwargs) 450 postcomputes.append(x.dask_postcompute()) 451 --> 452 results = schedule(dsk, keys, kwargs) 453 return repack([f(r, a) for r, (f, a) in zip(results, postcomputes)]) 454 /project/mrgoodbar/dcherian/python/xarray/xarray/tests/init.py in call(self, dsk, keys, kwargs) 112 raise RuntimeError( 113 "Too many computes. Total: %d > max: %d." --> 114 % (self.total_computes, self.max_computes) 115 ) 116 return dask.get(dsk, keys, kwargs) RuntimeError: Too many computes. Total: 2 > max: 1. ``` So here it looks like we don't support indexing by dask arrays, so as we loop through the dataset the |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924 | |
740945474 | https://github.com/pydata/xarray/issues/4663#issuecomment-740945474 | https://api.github.com/repos/pydata/xarray/issues/4663 | MDEyOklzc3VlQ29tbWVudDc0MDk0NTQ3NA== | dcherian 2448579 | 2020-12-08T20:01:38Z | 2020-12-08T20:01:38Z | MEMBER | I commented too soon.
only computes once (!) so there's something else going on possibly |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 1