home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

8 rows where issue = 702646191 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • TomAugspurger 3
  • dcherian 2
  • JSKenyon 2
  • ccarouge 1

author_association 2

  • MEMBER 5
  • NONE 3

issue 1

  • Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0 · 8 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
712066302 https://github.com/pydata/xarray/issues/4428#issuecomment-712066302 https://api.github.com/repos/pydata/xarray/issues/4428 MDEyOklzc3VlQ29tbWVudDcxMjA2NjMwMg== TomAugspurger 1312546 2020-10-19T11:08:13Z 2020-10-19T11:43:46Z MEMBER

Sorry, my comment in https://github.com/pydata/xarray/issues/4428#issuecomment-711034128 was incorrect in a couple ways

  1. We still do the splitting, even when slicing with an out-of-order indexer. Checking on if that's appropriate.
  2. I'm checking in on a logic bug when computing the number of chunks. I don't think we properly handle non-uniform chunking on the other axes.
{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0 702646191
711034128 https://github.com/pydata/xarray/issues/4428#issuecomment-711034128 https://api.github.com/repos/pydata/xarray/issues/4428 MDEyOklzc3VlQ29tbWVudDcxMTAzNDEyOA== TomAugspurger 1312546 2020-10-17T15:54:48Z 2020-10-17T15:54:48Z MEMBER

I assume that the indices [np.argsort(da.x.data)] are not going to be monotonically increasing. That induces a different slicing pattern. The docs in https://docs.dask.org/en/latest/array-slicing.html#efficiency describe the case where the indices are sorted, but doesn't discuss the non-sorted case (yet).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0 702646191
710683863 https://github.com/pydata/xarray/issues/4428#issuecomment-710683863 https://api.github.com/repos/pydata/xarray/issues/4428 MDEyOklzc3VlQ29tbWVudDcxMDY4Mzg2Mw== dcherian 2448579 2020-10-16T22:40:50Z 2020-10-16T22:40:50Z MEMBER

@TomAugspurger @jbusecke is seeing some funny behaviour in https://github.com/jbusecke/cmip6_preprocessing/issues/58

Here's a reproducer ``` python import dask import numpy as np import xarray as xr

dask.config.set( **{ "array.slicing.split_large_chunks": True, "array.chunk-size": "24 MiB", } )

da = xr.DataArray( dask.array.random.random((10, 1000, 2000), chunks=(-1, -1, 200)), dims=["x", "y", "time"], coords={"x": [3, 4, 5, 6, 7, 9, 8, 0, 2, 1]}, ) da ```

Which is basically

python da.data[np.argsort(da.x.data), ...]

I don't understand why its rechunking when we are indexing with a list along a dimension with a single chunk...

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0 702646191
709539887 https://github.com/pydata/xarray/issues/4428#issuecomment-709539887 https://api.github.com/repos/pydata/xarray/issues/4428 MDEyOklzc3VlQ29tbWVudDcwOTUzOTg4Nw== TomAugspurger 1312546 2020-10-15T19:20:53Z 2020-10-15T19:20:53Z MEMBER

Closing the loop here, with https://github.com/dask/dask/pull/6665 the behavior of Dask=2.25.0 should be restored (possibly with a warning about creating large chunks).

So this can probably be closed, though there may be parts of xarray that should be updated to avoid creating large chunks, or we could rely on the user to do that through the dask config system.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0 702646191
696475388 https://github.com/pydata/xarray/issues/4428#issuecomment-696475388 https://api.github.com/repos/pydata/xarray/issues/4428 MDEyOklzc3VlQ29tbWVudDY5NjQ3NTM4OA== ccarouge 8587080 2020-09-22T02:19:03Z 2020-09-22T02:19:03Z NONE

Hi. This change of behaviour broke an interpolation for me. The interpolation function does a sortby along the interpolated dimension. But then you can't interpolate along a chunked dimension. I would argue the interpolation function needs to rechunk after the sortby to the original values or stop people from interpolating without assume_sorted=True with a dask array.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0 702646191
693552440 https://github.com/pydata/xarray/issues/4428#issuecomment-693552440 https://api.github.com/repos/pydata/xarray/issues/4428 MDEyOklzc3VlQ29tbWVudDY5MzU1MjQ0MA== JSKenyon 6582745 2020-09-16T17:31:54Z 2020-09-16T17:31:54Z NONE

Thanks! I will definitely give that a go when I am back at my work PC. My personal take is that this level of automated rechunking is dangerous. I have constructed the chunking in my code with great care and for a reason. Having it changed "invisibly" by operations which didn't have this behaviour previously seems problematic to me.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0 702646191
693475844 https://github.com/pydata/xarray/issues/4428#issuecomment-693475844 https://api.github.com/repos/pydata/xarray/issues/4428 MDEyOklzc3VlQ29tbWVudDY5MzQ3NTg0NA== dcherian 2448579 2020-09-16T15:17:44Z 2020-09-16T15:17:44Z MEMBER

This looks like a consequence of https://github.com/dask/dask/pull/6514 . That change helps with cases like https://github.com/pydata/xarray/issues/4112

sortby is basically an isel indexing operation; so dask is automatically rechunking to make chunks with size < the default. You could fix this by setting an appropriate value in array.chunk-size either temporarily or permanently

python with dask.config.set({"array.chunk-size": "256MiB"}): # or appropriate value ...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0 702646191
693385409 https://github.com/pydata/xarray/issues/4428#issuecomment-693385409 https://api.github.com/repos/pydata/xarray/issues/4428 MDEyOklzc3VlQ29tbWVudDY5MzM4NTQwOQ== JSKenyon 6582745 2020-09-16T12:54:39Z 2020-09-16T12:54:39Z NONE

Finally managed to reproduce. Here it is: ```python import xarray import dask.array as da import numpy as np

if name == "main":

data = da.random.random([10000, 16, 4], chunks=(10000, 16, 4))

dtype = np.float32

xds = xarray.Dataset(
    data_vars={"DATA1": (("x", "y", "z"), data.astype(dtype))})

upsample_factor = 1024//xds.dims["y"]

# Create a selection which will upsample the y axis.
selection = np.repeat(np.arange(xds.dims["y"]), upsample_factor)

print("xarray.Dataset prior to resampling:\n", xds)

xds = xds.sel({"y": selection})

print("xarray.Dataset post resampling:\n", xds)

```

With dask==2.25.0 this gives: xarray.Dataset prior to resampling: <xarray.Dataset> Dimensions: (x: 10000, y: 16, z: 4) Dimensions without coordinates: x, y, z Data variables: DATA1 (x, y, z) float32 dask.array<chunksize=(10000, 16, 4), meta=np.ndarray> xarray.Dataset post resampling: <xarray.Dataset> Dimensions: (x: 10000, y: 1024, z: 4) Dimensions without coordinates: x, y, z Data variables: DATA1 (x, y, z) float32 dask.array<chunksize=(10000, 1024, 4), meta=np.ndarray>

With dask==2.26.0 this gives: xarray.Dataset prior to resampling: <xarray.Dataset> Dimensions: (x: 10000, y: 16, z: 4) Dimensions without coordinates: x, y, z Data variables: DATA1 (x, y, z) float32 dask.array<chunksize=(10000, 16, 4), meta=np.ndarray> xarray.Dataset post resampling: <xarray.Dataset> Dimensions: (x: 10000, y: 1024, z: 4) Dimensions without coordinates: x, y, z Data variables: DATA1 (x, y, z) float32 dask.array<chunksize=(10000, 512, 4), meta=np.ndarray>

And finally, the most distressing part - changing the dtype changes the chunking! With dtype = np.complex64, dask==2.26.0 gives: xarray.Dataset prior to resampling: <xarray.Dataset> Dimensions: (x: 10000, y: 16, z: 4) Dimensions without coordinates: x, y, z Data variables: DATA1 (x, y, z) complex64 dask.array<chunksize=(10000, 16, 4), meta=np.ndarray> xarray.Dataset post resampling: <xarray.Dataset> Dimensions: (x: 10000, y: 1024, z: 4) Dimensions without coordinates: x, y, z Data variables: DATA1 (x, y, z) complex64 dask.array<chunksize=(10000, 342, 4), meta=np.ndarray>

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0 702646191

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 19.143ms · About: xarray-datasette