id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1639841581,I_kwDOAMm_X85hvf8t,7672,to_zarr writes unexpected NaNs with chunks=-1,8249360,open,0,,,5,2023-03-24T18:15:06Z,2023-11-09T06:10:21Z,,NONE,,,,"### What happened? I'm running into some unexpected behavior with `ds.to_zarr` when my encoding includes `chunks=-1` and `ds` is a dataset that I created by operating on zarr files opened from disk. When I run the following example code, many of my values are `NaN`. When I run the same code, but with `ds.load()` before `ds.to_zarr()`, the correct, non-NaN values are saved. ### What did you expect to happen? My data would be written the same regardless of whether I explicitly loaded the dataset. The documentation for `xarray.Dataset.load` includes the following: > Normally, it should not be necessary to call this method in user code, because all xarray functions should either work on deferred data or load data automatically. However, this method can be necessary when working with many file objects on disk. I encountered this situation when operating on datasets that had been loaded from disk (`.sel`, then `concat`), so this seems like a situation that the second sentence addresses, but I did not expect it to silently fail to write the correct data in the way that it did. ### Minimal Complete Verifiable Example ```Python import pandas as pd import xarray as xr import numpy as np def create_dataset(time, site): temperature = 15 + 8 * np.random.randn(1, 3) precipitation = 10 * np.random.rand(1, 3) ds = xr.Dataset( data_vars=dict( temperature=([""site"", ""time""], temperature), precipitation=([""site"", ""time""], precipitation), ), coords=dict( site=site, time=time, ), attrs=dict(description=""Weather related data.""), ) return ds time_1 = pd.date_range(""2014-09-06"", periods=3) time_2 = pd.date_range(""2014-09-09"", periods=3) # create and save the first dataset as a zarr ds_a = create_dataset(time_1, [""site_1""]) fname_a = '/tmp/ds_a.zarr' ds_a.to_zarr(fname_a, mode='w') ds_a_from_disk = xr.open_dataset(fname_a, engine='zarr', chunks={}) # create and save the second dataset as a zarr ds_b = create_dataset(time_2, [""site_1""]) fname_b = '/tmp/ds_b.zarr' ds_b.to_zarr(fname_b, mode='w') ds_b_from_disk = xr.open_dataset(fname_b, engine='zarr', chunks={}) # concatenate the datasets ds = xr.concat([ds_a_from_disk.sel(site=""site_1""), ds_b_from_disk.sel(site=""site_1"")], dim='time') # save all data in one chunk encoding = {var: {'chunks': -1} for var in list(ds) + list(ds.coords)} fname = '/tmp/concated.zarr' # Uncomment the following line to fix this issue # ds.load() # save the dataset ds.to_zarr(fname, mode='w', encoding=encoding) ds_from_disk = xr.open_dataset(fname, engine='zarr') print(ds_from_disk.to_dataframe()) ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. ### Relevant log output _No response_ ### Anything else we need to know? Example output without `ds.load()`: ``` precipitation site temperature time 2014-09-06 NaN site_1 9.805297 2014-09-07 NaN site_1 16.119194 2014-09-08 NaN site_1 4.226150 2014-09-09 7.275470 site_1 NaN 2014-09-10 2.899134 site_1 NaN 2014-09-11 5.777094 site_1 NaN ``` Example output with `ds.load()`: ``` precipitation site temperature time 2014-09-06 3.445305 site_1 18.144503 2014-09-07 7.708728 site_1 20.289742 2014-09-08 7.358939 site_1 19.996060 2014-09-09 6.211692 site_1 9.748291 2014-09-10 4.981796 site_1 -7.676436 2014-09-11 8.667885 site_1 31.934328 ``` My hunch is that this has to do with a mismatch between Dask chunks in the unloaded dataset and the chunks specified in `to_zarr`, but if they are incompatible I would expect to see an error surfaced. ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.9.16 (main, Dec 7 2022, 01:11:51) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.19.0-35-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.0 libnetcdf: None xarray: 2023.3.0 pandas: 1.4.0 numpy: 1.22.4 scipy: 1.8.0 netCDF4: None pydap: None h5netcdf: None ... pytest: 6.2.2 mypy: None IPython: 8.3.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7672/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue