home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

8 rows where author_association = "MEMBER", issue = 333312849 and user = 306380 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • mrocklin · 8 ✖

issue 1

  • why time grouping doesn't preserve chunks · 8 ✖

author_association 1

  • MEMBER · 8 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
398838600 https://github.com/pydata/xarray/issues/2237#issuecomment-398838600 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODgzODYwMA== mrocklin 306380 2018-06-20T17:48:49Z 2018-06-20T17:48:49Z MEMBER

I've implemented something here: https://github.com/dask/dask/pull/3648

Playing with it would be welcome.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398586226 https://github.com/pydata/xarray/issues/2237#issuecomment-398586226 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODU4NjIyNg== mrocklin 306380 2018-06-20T00:26:39Z 2018-06-20T00:26:39Z MEMBER

Thanks. This example helps.

As you can see, if you concatenate together the first set of indices and index by the second set of indices, it would arrange them into sequential integers.

I'm not sure I understand this.

The situation on the whole does seem sensible to me though. This starts to look a little bit like a proper shuffle situation (using dataframe terminology). Each of your 365 output partitions would presumably touch 1/12th of your input partitions, leading to a quadratic number of tasks. If after doing something you then wanted to rearrange your data back then presumably that would require an equivalent number of extra tasks.

Am I understanding the situation correctly?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398582100 https://github.com/pydata/xarray/issues/2237#issuecomment-398582100 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODU4MjEwMA== mrocklin 306380 2018-06-19T23:59:58Z 2018-06-19T23:59:58Z MEMBER

So if you're willing to humor me for a moment with dask.array examples, if you have an array that's currently partitioned by month:

x = da.ones((1000, ...), chunks=(30, ...))  # approximately

And you do something by time.dayofyear, what do you end up doing to the array in dask array operations? Sorry to be a bit slow here. I'm not as familiar with how XArray translates its groupby operations to dask.array operations under the hood.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398581508 https://github.com/pydata/xarray/issues/2237#issuecomment-398581508 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODU4MTUwOA== mrocklin 306380 2018-06-19T23:56:22Z 2018-06-19T23:56:22Z MEMBER

So my question was "if you're grouping data by month, and it's already partitioned by month, then why are the indices out of order?" However it may be that you've answer this in your most recent comment, I'm not sure. It may also be that I'm not understanding the situation.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398577207 https://github.com/pydata/xarray/issues/2237#issuecomment-398577207 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODU3NzIwNw== mrocklin 306380 2018-06-19T23:29:37Z 2018-06-19T23:29:37Z MEMBER

That said, it's still probably more graceful to fail by creating too many small tasks rather than one giant task.

Maybe. We'll blow out the scheduler with too many tasks. With one large task we'll probably just start losing workers from memory errors.

In your example what does the chunking of the indexed array likely to look like? How is the interaction between contiguous regions of the index and the chunk structure of the indexed array?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398575620 https://github.com/pydata/xarray/issues/2237#issuecomment-398575620 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODU3NTYyMA== mrocklin 306380 2018-06-19T23:20:23Z 2018-06-19T23:20:23Z MEMBER

It's also probably worth thinking about the kind of operations you're trying to do, and how streamable they are. For example, if you were to take a dataset that was partitioned chronologically by month and then do some sort of day-of-month grouping then that would require the full dataset to be in memory at once.

If you're doing something like grouping on every month (keeping months of different years separate) then presumably your index is already sorted, and so you should be fine with the current behavior.

It might be useful to take a look at how the various XArray cases you care about convert to dask array slicing operations.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398573000 https://github.com/pydata/xarray/issues/2237#issuecomment-398573000 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODU3MzAwMA== mrocklin 306380 2018-06-19T23:03:53Z 2018-06-19T23:03:53Z MEMBER

OK, so lowering down to a dask array conversation, lets look at a couple examples. First, lets look at the behavior of a sorted index:

```python import dask.array as da x = da.ones((20, 20), chunks=(4, 5)) x.chunks

((4, 4, 4, 4, 4), (5, 5, 5, 5))

```

If we index that array with a sorted index, we are able to efficiently preserve chunking:

```python import numpy as np

x[np.arange(20), :].chunks

((4, 4, 4, 4, 4), (5, 5, 5, 5))

x[np.arange(20) // 2, :].chunks

((8, 8, 4), (5, 5, 5, 5))

```

However if the index isn't sorted then everything goes into one big chunk:

```python x[np.arange(20) % 3, :].chunks

((20,), (5, 5, 5, 5))

```

We could imagine a few alternatives here:

  1. Make a chunk for every element in the index
  2. Make a chunk for every contiguous run in the index. So here we would have chunk dimensions of size 3 matching the 0, 1, 2, 0, 1, 2, 0, 1, 2 pattern of our index.

I don't really have a strong intuition for how the xarray operations transform into dask array operations (my brain is a bit tired right now, so thinking is hard) but my guess is that they would benefit from the second case. (A pure dask.array example would be welcome).

Now we have to consider how enacting a policy like "put contiguous index regions into the same chunk" might go wrong, and how we might defend against it generally.

python x = da.ones(10000, chunks=(100,)) # 100 chunks of size 100 index = np.array([0, 100, 200, 300, ..., 1, 101, 201, 301, ..., 2, 102, 202, 302, ...]) x[index]

In the example above we have a hundred input chunks and a hundred contiguous regions in our index. Seems good. However each output chunk touches each input chunk, so this will likely create 10,000 tasks, which we should probably consider a fail case here.

So we learn that we need to look pretty carefully at how the values within the index interact with the chunk structure in order to know if we can do this well. This isn't an insurmountable problem, but isn't trivial either.

In principle we're looking for a function that takes in two inputs:

  1. The chunks of a single dimension like x.chunks[i] or (4, 4, 4, 4, 4) from our first example
  2. An index like np.arange(20) % 3 from our first example

And outputs a bunch of smaller indexes to pass on to various chunks. However, it hopefully does this in a way that is efficient, and fails early if it's going to emit a bunch of very small slices.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849
398218407 https://github.com/pydata/xarray/issues/2237#issuecomment-398218407 https://api.github.com/repos/pydata/xarray/issues/2237 MDEyOklzc3VlQ29tbWVudDM5ODIxODQwNw== mrocklin 306380 2018-06-18T22:43:25Z 2018-06-18T22:43:25Z MEMBER

I think that it would be useful to consider many possible cases of how people might want to chunk dask arrays with out-of-order indices, and the desired chunking outputs. XArray users like those here can provide some of those use cases. We'll have to gather others from other communities. Maybe once we have enough use cases gathered then rules for what correct behavior should be will emerge?

On Mon, Jun 18, 2018 at 5:16 PM Stephan Hoyer notifications@github.com wrote:

I vaguely recall discussing chunks that result from indexing somewhere in the dask issue tracker (when we added the special case for a monotonic increasing indexer to preserve chunks), but I can't find it now.

I think the challenge is that it isn't obvious what the right chunksizes should be. Chunks that are too small also have negative performance implications. Maybe the automatic chunking logic that @mrocklin https://github.com/mrocklin has been looking into recently would be relevant here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2237#issuecomment-398198466, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJeod5WLFa94XQo_6AwKwBdSpC9Rks5t-Bi3gaJpZM4Ur8XO .

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  why time grouping doesn't preserve chunks 333312849

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 884.256ms · About: xarray-datasette