home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1983891070

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1983891070 I_kwDOAMm_X852P8Z- 8427 Ambiguous behavior with coordinates when appending to Zarr store with append_dim 1197350 closed 0     4 2023-11-08T15:40:19Z 2023-12-01T03:58:56Z 2023-12-01T03:58:55Z MEMBER      

What happened?

There are two quite different scenarios covered by "append" with Zarr

  • Adding new variables to a dataset
  • Extending arrays along a dimensions (via append_dim)

This issue is about what should happen when using append_dim with variables that do not contain append_dim.

Here's the current behavior.

```python import xarray as xr import zarr

ds1 = xr.DataArray( np.array([1, 2, 3]).reshape(3, 1, 1), dims=('time', 'y', 'x'), coords={'x': [1], 'y': [2]}, name="foo" ).to_dataset()

ds2 = xr.DataArray( np.array([4, 5]).reshape(2, 1, 1), dims=('time', 'y', 'x'), coords={'x':[-1], 'y': [-2]}, name="foo" ).to_dataset()

how concat works: data are aligned

ds_concat = xr.concat([ds1, ds2], dim="time") assert ds_concat.dims == {"time": 5, "y": 2, "x": 2}

now do a Zarr append

store = zarr.storage.MemoryStore() ds1.to_zarr(store, consolidated=False)

we do not check that the coordinates are aligned--just that they have the same shape and dtype

ds2.to_zarr(store, append_dim="time", consolidated=False) ds_append = xr.open_zarr(store, consolidated=False)

coordinates data have been overwritten

assert ds_append.dims == {"time": 5, "y": 1, "x": 1}

...with the latest values

assert ds_append.x.data[0] == -1 ```

Currently, we always write all data variables in this scenario. That includes overwriting the coordinates every time we append. That makes appending more expensive than it needs to be. I don't think that is the behavior most users want or expect.

What did you expect to happen?

There are a couple of different options we could consider for how to handle this "extending" situation (with append_dim)

  1. [current behavior] Do not attempt to align coordinates a. [current behavior] Overwrite coordinates with new data b. Keep original coordinates c. Force the user to explicitly drop the coordinates, as we do for region operations.
  2. Attempt to align coordinates a. Fail if coordinates don't match b. Extend the arrays to replicate the behavior of concat

We currently do 1a. I propose to switch to 1b. I think it is closer to what users want, and it requires less I/O.

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.10.176-157.645.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.10.1 pandas: 2.1.2 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.5 pydap: installed h5netcdf: 1.2.0 h5py: 3.10.0 Nio: None zarr: 2.16.0 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.10.1 distributed: 2023.10.1 matplotlib: 3.8.0 cartopy: 0.22.0 seaborn: 0.13.0 numbagg: 0.6.0 fsspec: 2023.10.0 cupy: None pint: 0.22 sparse: 0.14.0 flox: 0.8.1 numpy_groupies: 0.10.2 setuptools: 68.2.2 pip: 23.3.1 conda: None pytest: 7.4.3 mypy: None IPython: 8.16.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8427/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 0.675ms · About: xarray-datasette