home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

1 row where user = 140395181 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 1

  • issue 1

state 1

  • closed 1

repo 1

  • xarray 1
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2211106929 I_kwDOAMm_X86DytBx 8882 to_zarr silently loses data when using append_dim, if chunks are different to zarr store harryC-space-intelligence 140395181 closed 0     4 2024-03-27T15:27:02Z 2024-03-29T14:35:51Z 2024-03-29T14:35:51Z NONE      

What happened?

When writing a chunked DataArray to an existing zarr store, appending along an existing dimension of the store, I have found that some data are not written if there are multiple array chunks to one zarr chunk.

I appreciate it is probably bad practice to have different chunksizes in my DataArray and zarr_store, but I think its a realistic scenario that needs to be caught.

This may be related to / the same underlying issue as #8371. Perhaps the checks mentioned in https://github.com/pydata/xarray/issues/8371#issuecomment-1814589157 are somehow getting bypassed? Using zarr's ThreadSynchronizer is the only way I have found to ensure that all the data gets written.

What did you expect to happen?

I expected that either

  • to_zarr would recognise the different chunk sizes, and re-chunk or wait for all the chunks to be written
  • or an error would be raised, given that the results result in loss of data in an unpredictable way

Minimal Complete Verifiable Example

```Python import xarray as xr import numpy as np from matplotlib import pyplot as plt

x_coords = np.arange(10) y_coords = np.arange(10) t_coords = np.array([np.datetime64('2020-01-01').astype('datetime64[ns]')]) data = np.ones((10,10))

for i in range(4): plt.subplot(1,4,i+1)

da = xr.DataArray(data.reshape((-1,10,10)),
                  dims = ['time','x','y'],
                  coords = {'x':x_coords, 'y':y_coords, 'time':t_coords},
                 ).chunk({'x':5, 'y':5,'time':1}).rename('foo')

da.to_zarr('foo.zarr', mode='w')

new_time = np.array([np.datetime64('2021-01-01').astype('datetime64[ns]')])

da2 = xr.DataArray(data.reshape((-1,10,10)),
                  dims = ['time','x','y'],
                  coords = {'x':x_coords, 'y':y_coords, 'time':new_time},
                 ).chunk({'x':1, 'y':1,'time':1}).rename('foo')

da2.to_zarr('foo.zarr',append_dim='time', mode='a')

plt.imshow(xr.open_zarr('foo.zarr').isel(time=-1).foo.values)

```

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [ ] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

Output from the plots above:

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0] python-bits: 64 OS: Linux OS-release: 5.15.0-1041-azure machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2 xarray: 2024.2.0 pandas: 2.2.1 numpy: 1.26.4 scipy: 1.12.0 netCDF4: 1.6.5 pydap: installed h5netcdf: 1.3.0 h5py: 3.10.0 Nio: None zarr: 2.17.1 cftime: 1.6.3 nc_time_axis: 1.4.1 iris: None bottleneck: 1.3.8 dask: 2024.3.1 distributed: 2024.3.1 matplotlib: 3.8.3 cartopy: 0.22.0 seaborn: 0.13.2 numbagg: None fsspec: 2024.3.1 cupy: None pint: 0.23 sparse: 0.15.1 flox: 0.9.5 numpy_groupies: 0.10.2 setuptools: 69.2.0 pip: 24.0 conda: 24.1.2 pytest: 8.1.1 mypy: None IPython: 8.22.2 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8882/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 18.988ms · About: xarray-datasette