home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

6 rows where issue = 627600168 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • dcherian 3
  • shoyer 1
  • TomAugspurger 1
  • jbusecke 1

author_association 2

  • MEMBER 5
  • CONTRIBUTOR 1

issue 1

  • Unexpected chunking behavior when using `xr.align` with `join='outer'` · 6 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
704530619 https://github.com/pydata/xarray/issues/4112#issuecomment-704530619 https://api.github.com/repos/pydata/xarray/issues/4112 MDEyOklzc3VlQ29tbWVudDcwNDUzMDYxOQ== jbusecke 14314623 2020-10-06T20:20:34Z 2020-10-06T20:20:34Z CONTRIBUTOR

Just tried this with the newest dask version and can confirm that I do not get huge chunks anymore IF i specify dask.config.set({"array.slicing.split_large_chunks": True}). I also needed to modify the example to exceed the internal chunk size limitation: ```python import numpy as np import xarray as xr import dask dask.config.set({"array.slicing.split_large_chunks": True})

short_time = xr.cftime_range('2000', periods=12) long_time = xr.cftime_range('2000', periods=120)

data_short = np.random.rand(len(short_time)) data_long = np.random.rand(len(long_time)) n=1000 a = xr.DataArray(data_short, dims=['time'], coords={'time':short_time}).expand_dims(a=n, b=n).chunk({'time':3}) b = xr.DataArray(data_long, dims=['time'], coords={'time':long_time}).expand_dims(a=n, b=n).chunk({'time':3})

a,b = xr.align(a,b, join = 'outer') `` with the option turned on I get this fora`;

with the defaults, I still get one giant chunk.

Ill try this soon in a real world scenario described above. Just wanted to report back here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unexpected chunking behavior when using `xr.align` with `join='outer'` 627600168
643513541 https://github.com/pydata/xarray/issues/4112#issuecomment-643513541 https://api.github.com/repos/pydata/xarray/issues/4112 MDEyOklzc3VlQ29tbWVudDY0MzUxMzU0MQ== dcherian 2448579 2020-06-12T22:55:12Z 2020-06-12T22:55:12Z MEMBER

One option might be to rewrite Dask's indexing functionality to "split" chunks that are much larger than their inputs into smaller pieces, even if they all come from the same input chunk?

This is Tom's proposed solution in https://github.com/dask/dask/issues/6270

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unexpected chunking behavior when using `xr.align` with `join='outer'` 627600168
643512625 https://github.com/pydata/xarray/issues/4112#issuecomment-643512625 https://api.github.com/repos/pydata/xarray/issues/4112 MDEyOklzc3VlQ29tbWVudDY0MzUxMjYyNQ== shoyer 1217238 2020-06-12T22:50:57Z 2020-06-12T22:50:57Z MEMBER

The problem with chunking indexers is that then dask doesn't have any visibility into the indexing values, which means the graph now grows like the square of the number of chunks along an axis, instead of proportional to the number of chunks.

The real operation that xarray needs here is Variable._getitem_with_mask, i.e., indexing with -1 remapped to a fill value: https://github.com/pydata/xarray/blob/e8bd8665e8fd762031c2d9c87987d21e113e41cc/xarray/core/variable.py#L715

The padded portion of the array is used in indexing, but only so the result is aligned for np.where to replace with the fill value. We actually don't look at those values at all.

I don't know the best way to handle this. One option might be to rewrite Dask's indexing functionality to "split" chunks that are much larger than their inputs into smaller pieces, even if they all come from the same input chunk?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unexpected chunking behavior when using `xr.align` with `join='outer'` 627600168
643346497 https://github.com/pydata/xarray/issues/4112#issuecomment-643346497 https://api.github.com/repos/pydata/xarray/issues/4112 MDEyOklzc3VlQ29tbWVudDY0MzM0NjQ5Nw== dcherian 2448579 2020-06-12T15:51:31Z 2020-06-12T15:52:58Z MEMBER

Thanks @TomAugspurger

I think an upstream dask solution would be useful.

xarray automatic aligns objects everywhere and this alignment is what is blowing things up. For this reason I think xarray should explicitly chunk the indexer when aligning. We could use a reasonable chunk size like median chunk size of dataarray along that axis — this would respect the user's chunksize choices.

@shoyer What do you think?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unexpected chunking behavior when using `xr.align` with `join='outer'` 627600168
636808986 https://github.com/pydata/xarray/issues/4112#issuecomment-636808986 https://api.github.com/repos/pydata/xarray/issues/4112 MDEyOklzc3VlQ29tbWVudDYzNjgwODk4Ng== TomAugspurger 1312546 2020-06-01T11:44:23Z 2020-06-01T11:44:23Z MEMBER

Rechunking the indexer array is how I would be explicit about the desired chunk size. Opened https://github.com/dask/dask/issues/6270 to discuss this on the dask side.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unexpected chunking behavior when using `xr.align` with `join='outer'` 627600168
636334010 https://github.com/pydata/xarray/issues/4112#issuecomment-636334010 https://api.github.com/repos/pydata/xarray/issues/4112 MDEyOklzc3VlQ29tbWVudDYzNjMzNDAxMA== dcherian 2448579 2020-05-30T13:52:33Z 2020-05-30T13:53:31Z MEMBER

Great diagnosis @jbusecke .

Ultimately this comes down to dask indexing

``` python import dask.array

arr = dask.array.from_array([0, 1, 2, 3], chunks=(1,)) print(arr.chunks) # ((1, 1, 1, 1),)

align calls reindex which indexes with something like this

indexer = [0, 1, 2, 3, ] + [-1,] * 111 print(arr[indexer].chunks) # ((1, 1, 1, 112),)

maybe something like this is a solution

lazy_indexer = dask.array.from_array(indexer, chunks=arr.chunks[0][0], name="idx") print(arr[lazy_indexer].chunks) # ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),) ```

cc @TomAugspurger, the issue here is that big 112 size chunk takes down the cluster in https://github.com/NCAR/intake-esm/issues/225

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unexpected chunking behavior when using `xr.align` with `join='outer'` 627600168

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 716.149ms · About: xarray-datasette