id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 2206243581,I_kwDOAMm_X86DgJr9,8876,Possible race condition when appending to an existing zarr,157591329,closed,0,,,4,2024-03-25T16:59:52Z,2024-04-03T15:23:14Z,2024-03-29T14:35:52Z,NONE,,,,"### What happened? When appending to an existing zarr along a dimension (`to_zarr(..., mode='a', append_dim=""x"" ,..)`), if the dask chunking of the dataset to append does not align with the chunking of the existing zarr, the resulting _consolidated_ zarr store may have `NaN`s instead of the actual values it is supposed to have. ### What did you expect to happen? We would expected that zarr append to have the same behaviour as if we concatenate dataset _in memory_ (using `concat`) and write the whole result on a new zarr store in one go ### Minimal Complete Verifiable Example ```Python from distributed import Client, LocalCluster import xarray as xr import tempfile ds1 = xr.Dataset({""a"": (""x"", [1., 1.])}, coords={'x': [1, 2]}).chunk({""x"": 3}) ds2 = xr.Dataset({""a"": (""x"", [1., 1., 1., 1.])}, coords={'x': [3, 4, 5, 6]}).chunk({""x"": 3}) with Client(LocalCluster(processes=False, n_workers=1, threads_per_worker=2)): # The issue happens only when: threads_per_worker > 1 for i in range(0, 100): with tempfile.TemporaryDirectory() as store: print(store) ds1.to_zarr(store, mode=""w"") # write first dataset ds2.to_zarr(store, mode=""a"", append_dim=""x"") # append first dataset rez = xr.open_zarr(store).compute() # open consolidated dataset nb_values = rez.a.count().item(0) # count non NaN values if nb_values != 6: print(""found NaNs:"") print(rez.to_dataframe()) break ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. - [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies. ### Relevant log output ```Python /tmp/tmptg_pe6ox /tmp/tmpm7ncmuxd /tmp/tmpiqcgoiw2 /tmp/tmppma1ieo7 /tmp/tmpw5vi4cf0 /tmp/tmp1rmgwju0 /tmp/tmpm6tfswzi found NaNs: a x 1 1.0 2 1.0 3 1.0 4 1.0 5 1.0 6 NaN ``` ### Anything else we need to know? The example code snippet provided here, reproduces the issue. Since the issue occurs randomly, we loop in the example for a few times and stop when the issue occurs. In the example, when `ds1` is first written, since it only contains 2 values along the `x` dimension, the resulting .zarr store have the chunking: `{'x': 2}`, even though we called `.chunk({""x"": 3})`. Side note: This behaviour in itself is not problematic in this case, but the fact that the chunking is _silently_ changed made this issue harder to spot. However, when we try to append the second dataset `ds2`, that contains 4 values, the `.chunk({""x"": 3})` in the begining splits the dask array into 2 **dask chunks**, but in a way that does not align with **zarr chunks**. Zarr chunks: + chunk1 : `x: [1; 2]` + chunk2 : `x: [3; 4]` + chunk3 : `x: [5; 6]` Dask chunks for `ds2`: + chunk A: `x: [3; 4; 5]` + chunk B: `x: [6]` Both **dask** chunks A and B, are supposed to write on **zarr** chunk3 And depending on who writes first, we can end up with NaN on `x = 5` or `x = 6` instead of actual values. The issue obviously happens only when dask tasks are run in parallel. Using `safe_chunks = True` when calling `to_zarr` does not seem to help. We couldn't figure out from the documentation how to detect this kind of issues, and how to prevent them from happening (maybe using a synchronizer?) ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 5.15.133.1-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development xarray: 2024.2.0 pandas: 2.2.1 numpy: 1.26.4 scipy: 1.12.0 netCDF4: 1.6.5 pydap: None h5netcdf: 1.3.0 h5py: 3.10.0 Nio: None zarr: 2.17.1 cftime: 1.6.3 nc_time_axis: 1.4.1 iris: None bottleneck: 1.3.8 dask: 2024.3.1 distributed: 2024.3.1 matplotlib: 3.8.3 cartopy: None seaborn: 0.13.2 numbagg: 0.8.1 fsspec: 2024.3.1 cupy: None pint: None sparse: None flox: 0.9.5 numpy_groupies: 0.10.2 setuptools: 69.2.0 pip: 24.0 conda: None pytest: 8.1.1 mypy: None IPython: 8.22.2 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8876/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue