home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1639841581

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1639841581 I_kwDOAMm_X85hvf8t 7672 to_zarr writes unexpected NaNs with chunks=-1 8249360 open 0     5 2023-03-24T18:15:06Z 2023-11-09T06:10:21Z   NONE      

What happened?

I'm running into some unexpected behavior with ds.to_zarr when my encoding includes chunks=-1 and ds is a dataset that I created by operating on zarr files opened from disk. When I run the following example code, many of my values are NaN. When I run the same code, but with ds.load() before ds.to_zarr(), the correct, non-NaN values are saved.

What did you expect to happen?

My data would be written the same regardless of whether I explicitly loaded the dataset. The documentation for xarray.Dataset.load includes the following:

Normally, it should not be necessary to call this method in user code, because all xarray functions should either work on deferred data or load data automatically. However, this method can be necessary when working with many file objects on disk.

I encountered this situation when operating on datasets that had been loaded from disk (.sel, then concat), so this seems like a situation that the second sentence addresses, but I did not expect it to silently fail to write the correct data in the way that it did.

Minimal Complete Verifiable Example

```Python import pandas as pd import xarray as xr import numpy as np

def create_dataset(time, site): temperature = 15 + 8 * np.random.randn(1, 3) precipitation = 10 * np.random.rand(1, 3)

ds = xr.Dataset( data_vars=dict( temperature=(["site", "time"], temperature), precipitation=(["site", "time"], precipitation), ),

  coords=dict(
      site=site,
      time=time,
  ),
  attrs=dict(description="Weather related data."),

) return ds

time_1 = pd.date_range("2014-09-06", periods=3) time_2 = pd.date_range("2014-09-09", periods=3)

create and save the first dataset as a zarr

ds_a = create_dataset(time_1, ["site_1"]) fname_a = '/tmp/ds_a.zarr' ds_a.to_zarr(fname_a, mode='w') ds_a_from_disk = xr.open_dataset(fname_a, engine='zarr', chunks={})

create and save the second dataset as a zarr

ds_b = create_dataset(time_2, ["site_1"]) fname_b = '/tmp/ds_b.zarr' ds_b.to_zarr(fname_b, mode='w') ds_b_from_disk = xr.open_dataset(fname_b, engine='zarr', chunks={})

concatenate the datasets

ds = xr.concat([ds_a_from_disk.sel(site="site_1"), ds_b_from_disk.sel(site="site_1")], dim='time')

save all data in one chunk

encoding = {var: {'chunks': -1} for var in list(ds) + list(ds.coords)} fname = '/tmp/concated.zarr'

Uncomment the following line to fix this issue

ds.load()

save the dataset

ds.to_zarr(fname, mode='w', encoding=encoding)

ds_from_disk = xr.open_dataset(fname, engine='zarr') print(ds_from_disk.to_dataframe()) ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

Example output without ds.load(): precipitation site temperature time 2014-09-06 NaN site_1 9.805297 2014-09-07 NaN site_1 16.119194 2014-09-08 NaN site_1 4.226150 2014-09-09 7.275470 site_1 NaN 2014-09-10 2.899134 site_1 NaN 2014-09-11 5.777094 site_1 NaN

Example output with ds.load(): precipitation site temperature time 2014-09-06 3.445305 site_1 18.144503 2014-09-07 7.708728 site_1 20.289742 2014-09-08 7.358939 site_1 19.996060 2014-09-09 6.211692 site_1 9.748291 2014-09-10 4.981796 site_1 -7.676436 2014-09-11 8.667885 site_1 31.934328

My hunch is that this has to do with a mismatch between Dask chunks in the unloaded dataset and the chunks specified in to_zarr, but if they are incompatible I would expect to see an error surfaced.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.16 (main, Dec 7 2022, 01:11:51) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.19.0-35-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.0 libnetcdf: None xarray: 2023.3.0 pandas: 1.4.0 numpy: 1.22.4 scipy: 1.8.0 netCDF4: None pydap: None h5netcdf: None ... pytest: 6.2.2 mypy: None IPython: 8.3.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7672/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 2 rows from issue in issue_comments
Powered by Datasette · Queries took 0.773ms · About: xarray-datasette