issue_comments: 307002325

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/1440#issuecomment-307002325	https://api.github.com/repos/pydata/xarray/issues/1440	307002325	MDEyOklzc3VlQ29tbWVudDMwNzAwMjMyNQ==	12229877	2017-06-08T05:28:04Z	2017-06-08T05:28:04Z	CONTRIBUTOR	I love a real-world example 😄 This sounds pretty similar to how I'm thinking of doing it, with a few caveats - mostly that `cate` assumes that the data is 3D, has lat and lon, single time step, spatial dimensions wholly divisible some small N. Obviously this is fine for CCI data, but not generally true of things Xarray might open. Taking a step back for a moment, chunks are great for avoiding out-of-memory errors, faster processing of reorderable operations, and efficient indexing. The overhead is not great when data is small or chunks are small, it's bad when a single on-disk chunk is on multiple dask chunks, and very bad when a dask chunk includes several files. (of course all of these are generalisations with pathological cases, but IMO good enough to build some heuristics on) With that in mind, here's how I'd decide whether to use the heuristic: If `chunks` is a dict, never use heuristic (always use explicit user chunks) If `chunks` is a hint, eg set or list as discussed above, or a later proposal, always use heuristic mode - guided by the hint, of course. Files which may otherwise default to non-heuristic or non-chunking mode (eg in mfdataset) could use eg. the empty set to activate the heuristics without hints. If `chunks` is None, and the uncompressed data is above a size threshold (eg 500MB, 1GB), use chunks given by the heuristic Having decided to use a heuristic, we know the array shape and dimensions, the chunk shape if any, and the hint if any: Start by selecting a maximum nbytes for the chunk, eg 250 MB If the total array nbytes is <= max_nbytes, use a single chunk for the whole thing; return If the array is stored as `(a, b, ...)` chunks on disk, our dask chunks must be `(m.a, n.b, ...)`, ie each dimension has some independent integer multiple. Loop over dimensions not to chunk (per above, either those not in the set, or those in a string), adding one to the respective multiplier. Alternatively if the file is now five or less dask chunks across, simply divide the on-disk chunks into 4, 3, 2, or 1 dask chunks (avoiding a dimension of 1.1 chunks, etc) For each step, if this would make the chunk larger than max_nbytes, return. Repeat the loop for remaining dimensions. If the array is not chunked on disk, increase a divisor for each dimension to chunk until the chunk size is <= max_nbytes and return. It's probably a good idea to constrain this further, so that the ratio of chunk edge length along dimensions should not exceed the greater of 100:1 or four times the ratio of chunks on disk (I don't have universal profiling to back this up, but it's always worked well for me). This will mitigate the potentially-very-large effects of dimension order, especially in unchunked files or large chunks. For datasets (as opposed to arrays), I'd calculate chunks once for the largest dtype and just reuse that shape.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		233350060