home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 398573000

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/2237#issuecomment-398573000 https://api.github.com/repos/pydata/xarray/issues/2237 398573000 MDEyOklzc3VlQ29tbWVudDM5ODU3MzAwMA== 306380 2018-06-19T23:03:53Z 2018-06-19T23:03:53Z MEMBER

OK, so lowering down to a dask array conversation, lets look at a couple examples. First, lets look at the behavior of a sorted index:

```python import dask.array as da x = da.ones((20, 20), chunks=(4, 5)) x.chunks

((4, 4, 4, 4, 4), (5, 5, 5, 5))

```

If we index that array with a sorted index, we are able to efficiently preserve chunking:

```python import numpy as np

x[np.arange(20), :].chunks

((4, 4, 4, 4, 4), (5, 5, 5, 5))

x[np.arange(20) // 2, :].chunks

((8, 8, 4), (5, 5, 5, 5))

```

However if the index isn't sorted then everything goes into one big chunk:

```python x[np.arange(20) % 3, :].chunks

((20,), (5, 5, 5, 5))

```

We could imagine a few alternatives here:

  1. Make a chunk for every element in the index
  2. Make a chunk for every contiguous run in the index. So here we would have chunk dimensions of size 3 matching the 0, 1, 2, 0, 1, 2, 0, 1, 2 pattern of our index.

I don't really have a strong intuition for how the xarray operations transform into dask array operations (my brain is a bit tired right now, so thinking is hard) but my guess is that they would benefit from the second case. (A pure dask.array example would be welcome).

Now we have to consider how enacting a policy like "put contiguous index regions into the same chunk" might go wrong, and how we might defend against it generally.

python x = da.ones(10000, chunks=(100,)) # 100 chunks of size 100 index = np.array([0, 100, 200, 300, ..., 1, 101, 201, 301, ..., 2, 102, 202, 302, ...]) x[index]

In the example above we have a hundred input chunks and a hundred contiguous regions in our index. Seems good. However each output chunk touches each input chunk, so this will likely create 10,000 tasks, which we should probably consider a fail case here.

So we learn that we need to look pretty carefully at how the values within the index interact with the chunk structure in order to know if we can do this well. This isn't an insurmountable problem, but isn't trivial either.

In principle we're looking for a function that takes in two inputs:

  1. The chunks of a single dimension like x.chunks[i] or (4, 4, 4, 4, 4) from our first example
  2. An index like np.arange(20) % 3 from our first example

And outputs a bunch of smaller indexes to pass on to various chunks. However, it hopefully does this in a way that is efficient, and fails early if it's going to emit a bunch of very small slices.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  333312849
Powered by Datasette · Queries took 0.703ms · About: xarray-datasette