issue_comments: 398573000
This data as json
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/pydata/xarray/issues/2237#issuecomment-398573000 | https://api.github.com/repos/pydata/xarray/issues/2237 | 398573000 | MDEyOklzc3VlQ29tbWVudDM5ODU3MzAwMA== | 306380 | 2018-06-19T23:03:53Z | 2018-06-19T23:03:53Z | MEMBER | OK, so lowering down to a dask array conversation, lets look at a couple examples. First, lets look at the behavior of a sorted index: ```python import dask.array as da x = da.ones((20, 20), chunks=(4, 5)) x.chunks ((4, 4, 4, 4, 4), (5, 5, 5, 5))``` If we index that array with a sorted index, we are able to efficiently preserve chunking: ```python import numpy as np x[np.arange(20), :].chunks ((4, 4, 4, 4, 4), (5, 5, 5, 5))x[np.arange(20) // 2, :].chunks ((8, 8, 4), (5, 5, 5, 5))``` However if the index isn't sorted then everything goes into one big chunk: ```python x[np.arange(20) % 3, :].chunks ((20,), (5, 5, 5, 5))``` We could imagine a few alternatives here:
I don't really have a strong intuition for how the xarray operations transform into dask array operations (my brain is a bit tired right now, so thinking is hard) but my guess is that they would benefit from the second case. (A pure dask.array example would be welcome). Now we have to consider how enacting a policy like "put contiguous index regions into the same chunk" might go wrong, and how we might defend against it generally.
In the example above we have a hundred input chunks and a hundred contiguous regions in our index. Seems good. However each output chunk touches each input chunk, so this will likely create 10,000 tasks, which we should probably consider a fail case here. So we learn that we need to look pretty carefully at how the values within the index interact with the chunk structure in order to know if we can do this well. This isn't an insurmountable problem, but isn't trivial either. In principle we're looking for a function that takes in two inputs:
And outputs a bunch of smaller indexes to pass on to various chunks. However, it hopefully does this in a way that is efficient, and fails early if it's going to emit a bunch of very small slices. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
333312849 |