home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 935769790

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/2511#issuecomment-935769790 https://api.github.com/repos/pydata/xarray/issues/2511 935769790 IC_kwDOAMm_X843xra- 22492773 2021-10-06T08:47:24Z 2021-10-06T08:47:24Z NONE

@bzah I've been testing your code and I can confirm the increment of timing once the .compute() isn't in use. I've noticed that using your modification, seems that dask array is computed more than one time per sample. I've made some tests using a modified version from #3237 and here are my observations:

Assuming that we have only one sample object after the resample the expected result should be 1 compute and that's what we obtain if we call the computation before the .argmax() If .compute() is removed then I got 3 total computations. Just as a confirmation if you increase the sample you will get a multiple of 3 as a result of computes.

I still don't know the reason and if is correct or not but sounds weird to me; though it could explain the time increase.

@dcherian @shyer do you know if all this make any sense? should the .isel() automatically trig the computation or should give back a lazy array?

Here is the code I've been using (works only adding the modification proposed by @bzah)

``` import numpy as np import dask import xarray as xr

class Scheduler: """ From: https://stackoverflow.com/questions/53289286/ """

def __init__(self, max_computes=20):
    self.max_computes = max_computes
    self.total_computes = 0

def __call__(self, dsk, keys, **kwargs):
    self.total_computes += 1
    if self.total_computes > self.max_computes:
        raise RuntimeError(
            "Too many dask computations were scheduled: {}".format(
                self.total_computes
            )
        )
    return dask.get(dsk, keys, **kwargs)

scheduler = Scheduler()

with dask.config.set(scheduler=scheduler):

COORDS = dict(dim_0=pd.date_range("2042-01-01", periods=31, freq='D'),
              dim_1= range(0,500),
              dim_2= range(0,500))

da = xr.DataArray(np.random.rand(31 * 500 * 500).reshape((31, 500, 500)),
                  coords=COORDS).chunk(dict(dim_0=-1, dim_1=100, dim_2=100))

print(da)

resampled = da.resample(dim_0="MS")

for label, sample in resampled:

    #sample = sample.compute()
    idx = sample.argmax('dim_0')
    sampled = sample.isel(dim_0=idx)

print("Total number of computes: %d" % scheduler.total_computes)

```

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  374025325
Powered by Datasette · Queries took 0.68ms · About: xarray-datasette