home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 720785384

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4496#issuecomment-720785384 https://api.github.com/repos/pydata/xarray/issues/4496 720785384 MDEyOklzc3VlQ29tbWVudDcyMDc4NTM4NA== 35919497 2020-11-02T23:32:48Z 2020-11-03T09:28:48Z COLLABORATOR

I think we can keep talking here about xarray chunking interface.

It seems that the interface for chunking is a tricky problem in xarray. There are involved different interfaces already implemented: - dask: da.rechunk, da.from_array - xarray: xr.open_dataset - xarray: ds.chunk - xarray-zarr: xr.open_dataset(engine="zarr") (≈ xr.open_zarr)

They are similar, but there are some inconsistencies.

dask The allowed values for chunking in dask are: - dictionary (or tuple) - integers > 0 - -1: no chunking (along this dimension) - auto: allow the chunking (in this dimension) to accommodate ideal chunk sizes (default 128MiB)

The allowed values in the dictionary are: -1, auto, None (no change to the chunking along this dimension) Note: None isn't supported outside the dictionary. Note: If chunking along some dimension is not specified then the chunking along this dimension will not change (e.g. {} is equivalent to {0: None})

xarray: xr.open_dataset for all the engines != "zarr" It works as dask but also None is supported. If chunk is None then it doesn't use dask at all.

xarray: ds.chunk It works as dask but also None is supported. None is equivalent to a dictionary with all values None (and equivalent to the empty dictionary).

xarray: xr.open_dataset(engine="zarr") It works as dask except for: - None is supported. If chunk is None then it doesn't use dask at all. - If chunking along some dimension is not specified then encoded chunks are used. - auto is equivalent to the empty dictionary, encoded chunks are used. - auto inside the dictionary is passed on to dask and behaves as in dask.

Points to be discussed:

1) auto and {} The main problem is how to uniform dask and xarray-zarr.

Option 1 Maybe the encoded chunking provided by the backend can be seen just as the current on-disk data chunking. According to dask interface, if in a dictionary the chunks for some dimension are None or not defined, then the current chunking along that dimension doesn't change. From this perspective, we would have: - with auto it uses dask auto-chunking. - with -1 it uses dask but no chunking. - with {} it uses the backend encoded chunks (when available) for on-disk data (xr.open_dataset) and the current chunking for already opened datasets (ds.chunk)

Note: ds.chunk behavior would be unchanged Note: xr.open_dataset would be unchanged, except for engine="zarr", since currently the var.encodings["chunks"] is defined only by zarr.

Option 2 We could use a different new value for the encoded chunks (e.g.encoded TBC). Something like: open_dataset(chunks="encoded") open_dataset(chunks={"x": "encoded", "y": 10,...}) Both expressions could be supported. cons: - chunks="encoded": with zarr the user probably needs to specify always to use the encoded chunks. - chunks="encoded": the user must specify explicitly in the dictionary which dimension should be chunked with the encoded chunks, that's very inconvenient (but is it really used? @weiji14 do you have some idea about it?).

2) None chunks=None should produce the same result in xr.open_dataset and ds.rechunk.

@shoyer, @alexamici, @jhamman, @dcherian, @weiji14 suggestions are welcome

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  717410970
Powered by Datasette · Queries took 1.418ms · About: xarray-datasette