issue_comments: 309353545

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/1440#issuecomment-309353545	https://api.github.com/repos/pydata/xarray/issues/1440	309353545	MDEyOklzc3VlQ29tbWVudDMwOTM1MzU0NQ==	12229877	2017-06-19T06:50:57Z	2017-07-14T02:35:04Z	CONTRIBUTOR	I've just had a meeting at NCI which has helped clarify what I'm trying to do and how to tell if it's working. This comment is mostly for my own notes, and public for anyone interested. I'll refer to dask chunks as 'blocks' (as in 'blocked algorithms'), and netcdf chunks in a file as 'chunks', to avoid confusion) The approximate algorithm I'm thinking about is outlined in this comment above. Considerations, in rough order of performance impact, are: No block should include data from multiple files (near-absolute, due to locking - though concurrent read is supported on lower levels?) Contiguous access is critical in un-chunked files - reading the fastest-changing dimension in small parts murders IO performance. It may be important in chunked files too, at the level of chunks, but that's mostly down to Dask and user access patterns. Each chunk, if the file is chunked, must fall entirely into a single block. Blocks should usually be around 100MB, to balance scheduler overhead with memory usage of intermediate results. Blocks should be of even(ish) size - if a dimension is of size 100 with chunks of 30, better to have blocks of 60 at the expense of relative shape than have blocks of 90 with one almost empty. Chunks are cached by underlying libraries, so benchmarks need to clear the caches each run for valid results. Note that this affects IO but typically not decompression of chunks. Contra contiguous access (above), it might be good to prevent very skewed block shapes (ie 1XY or T11). Possibly limiting the lowest edge length of a block (10px?), or limiting the edge ratio of a block (20:1?) If a dimension hint is given (eg 'my analysis will be spatial'), making blocks larger along other dimensions will make whole-file processing more efficient, and subsetting less efficient. (Unless Dask can optimise away loading of part of a block?) Bottom line, I could come up with something pretty quickly but would perfer to take a little longer to write and explore some benchmarks.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		233350060