issues: 1965161886
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1965161886 | I_kwDOAMm_X851If2e | 8382 | Zarr Chunks: Too many chunks created if there is one small initial chunk. | 105014161 | closed | 0 | 12 | 2023-10-27T09:49:07Z | 2023-11-07T17:40:19Z | 2023-11-06T11:53:24Z | NONE | What is your issue?If the first Zarr chunk is small (a few items), every subsequent chunk created will be tiny, and this will cause massive issues reading back the dataset. Consider the following code (MCVE): ```python import numpy as np import xarray as xr Create and write a dataset with ONE tiny chunk per variableds = xr.Dataset() ds.coords["x"] = "x", np.zeros((1,), dtype=np.uint64) ds["data"] = "x", np.zeros((1,), dtype=np.bool_) ds.to_zarr("/tmp/temp.zarr") Append to that dataset a larger amount of datads2 = xr.Dataset() ds2.coords["x"] = "x", np.arange(1, 1000, dtype=np.uint64) ds2["data"] = "x", np.zeros(999, dtype=np.bool_) ds2.to_zarr("/tmp/temp.zarr", append_dim="x") These chunks should be MUCH larger, but they're one item each for me.ds_read = xr.open_zarr("/tmp/temp.zarr") for var in ds_read.variables: print(f"{var=}, {ds_read[var].encoding['chunks']=}") ``` Is there a way to change this behaviour by default, ideally within xarray (or zarr-python if it is responsible)? Perhaps, if only one chunk is present, the heuristic should consider appending to it instead of creating new chunks? |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/8382/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |