home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 884209406

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
884209406 MDU6SXNzdWU4ODQyMDk0MDY= 5286 Zarr chunks would overlap multiple dask chunks error 6130352 closed 0     3 2021-05-10T13:20:46Z 2021-05-12T16:16:05Z 2021-05-12T16:16:05Z NONE      

Would it be possible to get an explanation on how this situation results in a zarr chunk overlapping multiple dask chunks?

This code below is generating an array with 2 chunks, selecting one row from each chunk, and then writing that resulting two row array back to zarr. I don't see how it's possible in this case for one zarr chunk to correspond to different dask chunks. There are clearly two resulting dask chunks, two input zarr chunks, and a correspondence between them that should be 1 to 1 ... what does this error message really mean then?

```python import xarray as xr import dask.array as da

ds = xr.Dataset(dict( x=(('a', 'b'), da.ones(shape=(10, 10), chunks=(5, 10))), )).assign(a=list(range(10))) ds

<xarray.Dataset>

Dimensions: (a: 10, b: 10)

Coordinates:

* a (a) int64 0 1 2 3 4 5 6 7 8 9

Dimensions without coordinates: b

Data variables:

x (a, b) float64 dask.array<chunksize=(5, 10), meta=np.ndarray>

Write the dataset out

!rm -rf /tmp/test.zarr ds.to_zarr('/tmp/test.zarr')

Read it back in, subset to 1 record in two different chunks (two rows total), write back out

!rm -rf /tmp/test2.zarr xr.open_zarr('/tmp/test.zarr').sel(a=[0, 11]).to_zarr('/tmp/test2.zarr')

NotImplementedError: Specified zarr chunks encoding['chunks']=(5, 10) for variable named 'x' would overlap multiple dask chunks ((1, 1), (10,)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using chunk(), deleting or modifying encoding['chunks'], or specify safe_chunks=False.

```

Also what is the difference between "deleting or modifying encoding['chunks']" and "specify safe_chunks=False"? That wasn't clear to me in https://github.com/pydata/xarray/issues/5056.

Lastly and most importantly, can data be corrupted when using parallel zarr writes and just deleting encoding['chunks'] in these situations?

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 4.19.0-16-cloud-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: None libnetcdf: None xarray: 0.18.0 pandas: 1.2.4 numpy: 1.20.2 scipy: 1.6.3 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.8.1 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.04.1 distributed: 2021.04.1 matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20210108 pip: 21.1.1 conda: None pytest: 6.2.4 IPython: 7.23.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5286/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 3 rows from issue in issue_comments
Powered by Datasette · Queries took 0.615ms · About: xarray-datasette