issue_comments: 307002325
This data as json
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/pydata/xarray/issues/1440#issuecomment-307002325 | https://api.github.com/repos/pydata/xarray/issues/1440 | 307002325 | MDEyOklzc3VlQ29tbWVudDMwNzAwMjMyNQ== | 12229877 | 2017-06-08T05:28:04Z | 2017-06-08T05:28:04Z | CONTRIBUTOR | I love a real-world example 😄 This sounds pretty similar to how I'm thinking of doing it, with a few caveats - mostly that Taking a step back for a moment, chunks are great for avoiding out-of-memory errors, faster processing of reorderable operations, and efficient indexing. The overhead is not great when data is small or chunks are small, it's bad when a single on-disk chunk is on multiple dask chunks, and very bad when a dask chunk includes several files. (of course all of these are generalisations with pathological cases, but IMO good enough to build some heuristics on) With that in mind, here's how I'd decide whether to use the heuristic:
Having decided to use a heuristic, we know the array shape and dimensions, the chunk shape if any, and the hint if any:
It's probably a good idea to constrain this further, so that the ratio of chunk edge length along dimensions should not exceed the greater of 100:1 or four times the ratio of chunks on disk (I don't have universal profiling to back this up, but it's always worked well for me). This will mitigate the potentially-very-large effects of dimension order, especially in unchunked files or large chunks. For datasets (as opposed to arrays), I'd calculate chunks once for the largest dtype and just reuse that shape. |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
233350060 |