issues: 1983891070
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1983891070 | I_kwDOAMm_X852P8Z- | 8427 | Ambiguous behavior with coordinates when appending to Zarr store with append_dim | 1197350 | closed | 0 | 4 | 2023-11-08T15:40:19Z | 2023-12-01T03:58:56Z | 2023-12-01T03:58:55Z | MEMBER | What happened?There are two quite different scenarios covered by "append" with Zarr
This issue is about what should happen when using Here's the current behavior. ```python import xarray as xr import zarr ds1 = xr.DataArray( np.array([1, 2, 3]).reshape(3, 1, 1), dims=('time', 'y', 'x'), coords={'x': [1], 'y': [2]}, name="foo" ).to_dataset() ds2 = xr.DataArray( np.array([4, 5]).reshape(2, 1, 1), dims=('time', 'y', 'x'), coords={'x':[-1], 'y': [-2]}, name="foo" ).to_dataset() how concat works: data are alignedds_concat = xr.concat([ds1, ds2], dim="time") assert ds_concat.dims == {"time": 5, "y": 2, "x": 2} now do a Zarr appendstore = zarr.storage.MemoryStore() ds1.to_zarr(store, consolidated=False) we do not check that the coordinates are aligned--just that they have the same shape and dtypeds2.to_zarr(store, append_dim="time", consolidated=False) ds_append = xr.open_zarr(store, consolidated=False) coordinates data have been overwrittenassert ds_append.dims == {"time": 5, "y": 1, "x": 1} ...with the latest valuesassert ds_append.x.data[0] == -1 ``` Currently, we always write all data variables in this scenario. That includes overwriting the coordinates every time we append. That makes appending more expensive than it needs to be. I don't think that is the behavior most users want or expect. What did you expect to happen?There are a couple of different options we could consider for how to handle this "extending" situation (with
We currently do 1a. I propose to switch to 1b. I think it is closer to what users want, and it requires less I/O. Anything else we need to know?No response Environment
INSTALLED VERSIONS
------------------
commit: None
python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.176-157.645.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.2
xarray: 2023.10.1
pandas: 2.1.2
numpy: 1.24.4
scipy: 1.11.3
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.0
cftime: 1.6.2
nc_time_axis: 1.4.1
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.10.1
distributed: 2023.10.1
matplotlib: 3.8.0
cartopy: 0.22.0
seaborn: 0.13.0
numbagg: 0.6.0
fsspec: 2023.10.0
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.8.1
numpy_groupies: 0.10.2
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: None
IPython: 8.16.1
sphinx: None
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/8427/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |