home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 309353545

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1440#issuecomment-309353545 https://api.github.com/repos/pydata/xarray/issues/1440 309353545 MDEyOklzc3VlQ29tbWVudDMwOTM1MzU0NQ== 12229877 2017-06-19T06:50:57Z 2017-07-14T02:35:04Z CONTRIBUTOR

I've just had a meeting at NCI which has helped clarify what I'm trying to do and how to tell if it's working. This comment is mostly for my own notes, and public for anyone interested. I'll refer to dask chunks as 'blocks' (as in 'blocked algorithms'), and netcdf chunks in a file as 'chunks', to avoid confusion)

The approximate algorithm I'm thinking about is outlined in this comment above. Considerations, in rough order of performance impact, are:

  • No block should include data from multiple files (near-absolute, due to locking - though concurrent read is supported on lower levels?)
  • Contiguous access is critical in un-chunked files - reading the fastest-changing dimension in small parts murders IO performance. It may be important in chunked files too, at the level of chunks, but that's mostly down to Dask and user access patterns.
  • Each chunk, if the file is chunked, must fall entirely into a single block.
  • Blocks should usually be around 100MB, to balance scheduler overhead with memory usage of intermediate results.
  • Blocks should be of even(ish) size - if a dimension is of size 100 with chunks of 30, better to have blocks of 60 at the expense of relative shape than have blocks of 90 with one almost empty.
  • Chunks are cached by underlying libraries, so benchmarks need to clear the caches each run for valid results. Note that this affects IO but typically not decompression of chunks.
  • Contra contiguous access (above), it might be good to prevent very skewed block shapes (ie 1*X*Y or T*1*1). Possibly limiting the lowest edge length of a block (10px?), or limiting the edge ratio of a block (20:1?)
  • If a dimension hint is given (eg 'my analysis will be spatial'), making blocks larger along other dimensions will make whole-file processing more efficient, and subsetting less efficient. (Unless Dask can optimise away loading of part of a block?)

Bottom line, I could come up with something pretty quickly but would perfer to take a little longer to write and explore some benchmarks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  233350060
Powered by Datasette · Queries took 69.057ms · About: xarray-datasette