home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1845132891

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1845132891 I_kwDOAMm_X85t-n5b 8062 Dataset.chunk() does not overwrite encoding["chunks"] 2466330 open 0     4 2023-08-10T12:54:12Z 2023-08-14T18:23:36Z   CONTRIBUTOR      

What happened?

When using the chunk function to change the chunk sizes of a Dataset (or DataArray, which uses the Dataset implementation of chunk), the chunk sizes of the Dask arrays are changed, but the "chunks" entry of the encoding attributes are not changed accordingly. This causes the raising of a NotImplementedError when attempting to write the Dataset to a zarr (and presumably other formats as well).

Looking at the implementation of chunk, every variable is rechunked using the _maybe_chunk function, which actually has the parameter overwrite_encoded_chunks to control just this behavior. However, it is an optional parameter which defaults to False, and the call in chunk does not provide a value for this parameter, nor does it offer the caller to influence it (by having an overwrite_encoded_chunks parameter itself, for example).

I do not know why this default value was chosen as False, or what could break if it was changed to True, but looking at the documentation, it seems the opposite of the intended effect. From the documentation of to_zarr:

Zarr chunks are determined in the following way: From the chunks attribute in each variable’s encoding (can be set via Dataset.chunk).

Which is exactly what it doesn't.

What did you expect to happen?

I would expect the "chunks" entry of the encoding attribute to be changed to reflect the new chunking scheme.

Minimal Complete Verifiable Example

```Python import xarray as xr import numpy as np

Create a test Dataset with dimension x and y, each of size 100, and a chunksize of 50

ds_original = xr.Dataset({"my_var": (["x", "y"], np.random.randn(100, 100))})

Since 'chunk' does not work, manually set encoding

ds_original .my_var.encoding["chunks"] = (50, 50)

To best showcase the real-life example, write it to file and read it back again.

The same could be achieved by just calling .chunk() with chunksizes of 25, but this feels more 'complete'

filepath = "~/chunk_test.zarr" ds_original.to_zarr(filepath) ds = xr.open_zarr(filepath)

Check the chunksizes and "chunks" encoding

print(ds.my_var.chunks)

>>> ((50, 50), (50, 50))

print(ds.my_var.encoding["chunks"])

>>> (50, 50)

Rechunk the Dataset

ds = ds.chunk({"x": 25, "y": 25})

The chunksizes have changed

print(ds.my_var.chunks)

>>> ((25, 25, 25, 25), (25, 25, 25, 25))

But the encoding value remains the same

print(ds.my_var.encoding["chunks"])

>>> (50, 50)

Attempting to write this back to zarr raises an error

ds.to_zarr("~/chunk_test_rechunked.zarr")

NotImplementedError: Specified zarr chunks encoding['chunks']=(50, 50) for variable named 'my_var' would overlap multiple dask chunks ((25, 25, 25, 25), (25, 25, 25, 25)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using chunk(), deleting or modifying encoding['chunks'], or specify safe_chunks=False.

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.10.16.3-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.7 libnetcdf: 4.8.1 xarray: 2023.7.0 pandas: 1.5.3 numpy: 1.24.2 scipy: 1.10.0 netCDF4: 1.5.8 pydap: None h5netcdf: 0.12.0 h5py: 3.6.0 Nio: None zarr: 2.14.1 cftime: 1.5.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: 1.3.6 dask: 2022.01.0+dfsg distributed: 2022.01.0+ds.1 matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.1.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 59.6.0 pip: 23.2.1 conda: None pytest: 7.2.2 mypy: 1.1.1 IPython: 7.31.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8062/reactions",
    "total_count": 2,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
    13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 3.271ms · About: xarray-datasette