home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 759709924

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
759709924 MDU6SXNzdWU3NTk3MDk5MjQ= 4663 Fancy indexing a Dataset with dask DataArray triggers multiple computes 6130352 closed 0     8 2020-12-08T19:17:08Z 2023-03-15T02:48:01Z 2023-03-15T02:48:01Z NONE      

It appears that boolean arrays (or any slicing array presumably) are evaluated many more times than necessary when applied to multiple variables in a Dataset. Is this intentional? Here is an example that demonstrates this:

```python

Use a custom array type to know when data is being evaluated

class Array():

def __init__(self, x):
    self.shape = (x.shape[0],)
    self.ndim = x.ndim
    self.dtype = 'bool'
    self.x = x

def __getitem__(self, idx):
    if idx[0].stop > 0:
        print('Evaluating')
    return (self.x > .5).__getitem__(idx)

Control case -- this shows that the print statement is only reached once

da.from_array(Array(np.random.rand(100))).compute();

Evaluating

This usage somehow results in two evaluations of this one array?

ds = xr.Dataset(dict( a=('x', da.from_array(Array(np.random.rand(100)))) )) ds.sel(x=ds.a)

Evaluating

Evaluating

<xarray.Dataset>

Dimensions: (x: 51)

Dimensions without coordinates: x

Data variables:

a (x) bool dask.array<chunksize=(51,), meta=np.ndarray>

The array is evaluated an extra time for each new variable

ds = xr.Dataset(dict( a=('x', da.from_array(Array(np.random.rand(100)))), b=(('x', 'y'), da.random.random((100, 10))), c=(('x', 'y'), da.random.random((100, 10))), d=(('x', 'y'), da.random.random((100, 10))), )) ds.sel(x=ds.a)

Evaluating

Evaluating

Evaluating

Evaluating

Evaluating

<xarray.Dataset>

Dimensions: (x: 48, y: 10)

Dimensions without coordinates: x, y

Data variables:

a (x) bool dask.array<chunksize=(48,), meta=np.ndarray>

b (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>

c (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>

d (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>

```

Given that slicing is already not lazy, why does the same predicate array need to be computed more than once?

@tomwhite originally pointed this out in https://github.com/pystatgen/sgkit/issues/299.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4663/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 8 rows from issue in issue_comments
Powered by Datasette · Queries took 0.623ms · About: xarray-datasette