issue_comments: 720785384

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/4496#issuecomment-720785384	https://api.github.com/repos/pydata/xarray/issues/4496	720785384	MDEyOklzc3VlQ29tbWVudDcyMDc4NTM4NA==	35919497	2020-11-02T23:32:48Z	2020-11-03T09:28:48Z	COLLABORATOR	I think we can keep talking here about xarray chunking interface. It seems that the interface for chunking is a tricky problem in xarray. There are involved different interfaces already implemented: - dask: `da.rechunk`, `da.from_array` - xarray: `xr.open_dataset` - xarray: `ds.chunk` - xarray-zarr: `xr.open_dataset(engine="zarr")` (≈ `xr.open_zarr`) They are similar, but there are some inconsistencies. dask The allowed values for chunking in dask are: - dictionary (or tuple) - integers > 0 - `-1`: no chunking (along this dimension) - `auto`: allow the chunking (in this dimension) to accommodate ideal chunk sizes (default 128MiB) The allowed values in the dictionary are: `-1`, `auto`, `None` (no change to the chunking along this dimension) Note: `None` isn't supported outside the dictionary. Note: If chunking along some dimension is not specified then the chunking along this dimension will not change (e.g. {} is equivalent to {0: `None`}) xarray: `xr.open_dataset` for all the engines != "zarr" It works as dask but also `None` is supported. If `chunk` is `None` then it doesn't use dask at all. xarray: `ds.chunk` It works as dask but also `None` is supported. `None` is equivalent to a dictionary with all values `None` (and equivalent to the empty dictionary). xarray: xr.open_dataset(engine="zarr") It works as dask except for: - `None` is supported. If `chunk` is `None` then it doesn't use dask at all. - If chunking along some dimension is not specified then encoded chunks are used. - `auto` is equivalent to the empty dictionary, encoded chunks are used. - `auto` inside the dictionary is passed on to dask and behaves as in dask. Points to be discussed: 1) `auto` and `{}` The main problem is how to uniform dask and xarray-zarr. Option 1 Maybe the encoded chunking provided by the backend can be seen just as the current on-disk data chunking. According to dask interface, if in a dictionary the chunks for some dimension are `None` or not defined, then the current chunking along that dimension doesn't change. From this perspective, we would have: - with `auto` it uses dask auto-chunking. - with `-1` it uses dask but no chunking. - with `{}` it uses the backend encoded chunks (when available) for on-disk data (`xr.open_dataset`) and the current chunking for already opened datasets (`ds.chunk`) Note: `ds.chunk` behavior would be unchanged Note: `xr.open_dataset` would be unchanged, except for `engine="zarr"`, since currently the `var.encodings["chunks"]` is defined only by zarr. Option 2 We could use a different new value for the encoded chunks (e.g.`encoded` TBC). Something like: `open_dataset(chunks="encoded")` `open_dataset(chunks={"x": "encoded", "y": 10,...})` Both expressions could be supported. cons: - `chunks="encoded"`: with zarr the user probably needs to specify always to use the encoded chunks. - `chunks="encoded"`: the user must specify explicitly in the dictionary which dimension should be chunked with the encoded chunks, that's very inconvenient (but is it really used? @weiji14 do you have some idea about it?). 2) `None` `chunks=None` should produce the same result in `xr.open_dataset` and `ds.rechunk`. @shoyer, @alexamici, @jhamman, @dcherian, @weiji14 suggestions are welcome	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		717410970