issues: 759709924
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
759709924 | MDU6SXNzdWU3NTk3MDk5MjQ= | 4663 | Fancy indexing a Dataset with dask DataArray triggers multiple computes | 6130352 | closed | 0 | 8 | 2020-12-08T19:17:08Z | 2023-03-15T02:48:01Z | 2023-03-15T02:48:01Z | NONE | It appears that boolean arrays (or any slicing array presumably) are evaluated many more times than necessary when applied to multiple variables in a Dataset. Is this intentional? Here is an example that demonstrates this: ```python Use a custom array type to know when data is being evaluatedclass Array():
Control case -- this shows that the print statement is only reached onceda.from_array(Array(np.random.rand(100))).compute(); EvaluatingThis usage somehow results in two evaluations of this one array?ds = xr.Dataset(dict( a=('x', da.from_array(Array(np.random.rand(100)))) )) ds.sel(x=ds.a) EvaluatingEvaluating<xarray.Dataset>Dimensions: (x: 51)Dimensions without coordinates: xData variables:a (x) bool dask.array<chunksize=(51,), meta=np.ndarray>The array is evaluated an extra time for each new variableds = xr.Dataset(dict( a=('x', da.from_array(Array(np.random.rand(100)))), b=(('x', 'y'), da.random.random((100, 10))), c=(('x', 'y'), da.random.random((100, 10))), d=(('x', 'y'), da.random.random((100, 10))), )) ds.sel(x=ds.a) EvaluatingEvaluatingEvaluatingEvaluatingEvaluating<xarray.Dataset>Dimensions: (x: 48, y: 10)Dimensions without coordinates: x, yData variables:a (x) bool dask.array<chunksize=(48,), meta=np.ndarray>b (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>c (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>d (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>``` Given that slicing is already not lazy, why does the same predicate array need to be computed more than once? @tomwhite originally pointed this out in https://github.com/pystatgen/sgkit/issues/299. |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/4663/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |