home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

8 rows where author_association = "MEMBER", issue = 333312849 and user = 1197350 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • rabernat · 8 ✖

issue 1

  • why time grouping doesn't preserve chunks · 8 ✖

author_association 1

  • MEMBER · 8 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
620961663 https://github.com/pydata/xarray/issues/2237#issuecomment-620961663 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDYyMDk2MTY2Mw== rabernat 1197350 2020-04-29T02:45:28Z 2020-04-29T02:45:28Z MEMBER

I'm reviving this classic issue to report another quasi-failure of dask chunking, this time in the opposite direction.

Consider this dataset: python import xarray as xr ds = xr.Dataset({'foo': (['time'], dsa.ones(120, chunks=60))}, coords={'year': (['time'], np.repeat(np.arange(10), 12))})

<xarray.Dataset> Dimensions: (time: 120) Coordinates: year (time) int64 0 0 0 0 0 0 0 0 0 0 0 0 1 ... 9 9 9 9 9 9 9 9 9 9 9 9 Dimensions without coordinates: time Data variables: foo (time) float64 dask.array<chunksize=(60,), meta=np.ndarray>

There are just two big chunks.

Now let's try to take an "annual mean" using resample

python ds.foo.groupby('year').mean(dim='time')

<xarray.DataArray 'foo' (year: 10)> dask.array<stack, shape=(10,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray> Coordinates: * year (year) int64 0 1 2 3 4 5 6 7 8 9

Now we have a chunksize of 1 and 10 chunks. That's bad: we should still just have two chunks, since we are aggregating only within chunks. Taken to the limit of very high temporal resolution, this example will blow up in terms of number of tasks. I wish dask could figure out that it doesn't have to create all those tasks.

The graph looks like this

In contrast, coarsen is smart enough, probably because it relies on dask's underlying coarsen function ds.foo.coarsen(time=12).mean()

<xarray.DataArray (time: 10)> dask.array<mean_agg-aggregate, shape=(10,), dtype=float64, chunksize=(5,), chunktype=numpy.ndarray> Coordinates: year (time) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Dimensions without coordinates: time

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
482275708 https://github.com/pydata/xarray/issues/2237#issuecomment-482275708 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDQ4MjI3NTcwOA== rabernat 1197350 2019-04-11T19:37:05Z 2019-04-11T19:37:05Z MEMBER

We had a long iteration on this in Pangeo, and big progress was made in dask. Definitely closed for now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398597356 https://github.com/pydata/xarray/issues/2237#issuecomment-398597356 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODU5NzM1Ng== rabernat 1197350 2018-06-20T01:42:55Z 2018-06-20T01:42:55Z MEMBER

I'm glad to see that this has generated so much serious discussion and thought! I will try to catch up on it in the morning when I have some hope of understanding.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398240724 https://github.com/pydata/xarray/issues/2237#issuecomment-398240724 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODI0MDcyNA== rabernat 1197350 2018-06-19T00:57:44Z 2018-06-19T00:57:44Z MEMBER

With groupby in xarray, we have two main cases:

  1. groupby with reduction -- (e.g. ds.groupby('baz').mean(dim='x')). There is currently no problem here. The new dimension becomes baz and the array is chunked as {'baz': 1}.
  2. groupby with no reduction -- (e.g. ds.groubpy('baz').apply(lambda x: x - x.mean())). In this case, the point of the out-of-order indexing is actually to put the array back together in its original order. In my last example above, according to the dot graph, it looks like there are four chunks right up until the end. They just have to be re-ordered. I imagine this should be cheap and simple, but I am probably overlooking something.

Case 2 seems similar to @shoyer's example: x[np.arange(4)[::-1]. Here we would just want to reorder the existing chunks.

If the chunk size before reindexing is not 1, then yes, one needs to do something more sophisticated. But I would argue that, if the array is being re-indexed along a dimension in which the chunk size is 1, a sensible default behavior would be to avoid aggregating into a big chunk and instead just pass the original chunks though in a new order.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398158656 https://github.com/pydata/xarray/issues/2237#issuecomment-398158656 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODE1ODY1Ng== rabernat 1197350 2018-06-18T18:55:08Z 2018-06-18T18:55:08Z MEMBER

Thanks for the explanation @shoyer! Yes, that appears to be the root of the issue. After literally years of struggling with this, I am happy to finally get to this level of clarity.

So basically the issue comes down to indexing with dask.array, which creates a single chunk when integers indices are not all in order

Do we think dask is happy with that behavior? If not, then an upstream fix would be best. Pinging @mrocklin.

Otherwise we can try to work around in xarray.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398156747 https://github.com/pydata/xarray/issues/2237#issuecomment-398156747 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODE1Njc0Nw== rabernat 1197350 2018-06-18T18:48:34Z 2018-06-18T18:48:34Z MEMBER

And just because it's fun, I will show what the anomaly calculation looks like

ds.foo.groupby('bar').apply(lambda x: x - x.mean()).data.visualize():

ds.foo.groupby('baz').apply(lambda x: x - x.mean()).data.visualize():

It looks like everything is really ok up until the very end, where all the tasks aggregate into a single getitem call.

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 2,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398152064 https://github.com/pydata/xarray/issues/2237#issuecomment-398152064 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODE1MjA2NA== rabernat 1197350 2018-06-18T18:32:42Z 2018-06-18T18:32:42Z MEMBER

I agree that single value chunks illustrates the problem more clearly. I think this example is most clean if you do it like this python import xarray as xr import dask.array as dsa ds = xr.Dataset({'foo': (['x'], dsa.ones(4, chunks=1))}, coords={'x': (['x'], [0, 1, 2, 3]), 'bar': (['x'], ['a', 'a', 'b', 'b']), 'baz': (['x'], ['a', 'b', 'a', 'b'])})

ds.foo.groupby('bar').apply(lambda x: x).data.visualize():

ds.foo.groupby('baz').apply(lambda x: x).data.visualize()

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398150381 https://github.com/pydata/xarray/issues/2237#issuecomment-398150381 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODE1MDM4MQ== rabernat 1197350 2018-06-18T18:27:08Z 2018-06-18T18:27:08Z MEMBER

while your example shows that chunks are lost after the groupby, does that prove for sure that the groupby operation does not use the original chunks?

One way to answer that is the following:

Here is the dask graph for groupby('bar'):

Here is the dask graph for groupby('baz'):

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 950.378ms · About: xarray-datasette