home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 906748201

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
906748201 MDU6SXNzdWU5MDY3NDgyMDE= 5405 Control CF-encoding in to_zarr() 206773 open 0     2 2021-05-30T12:57:40Z 2021-06-23T15:47:32Z   NONE      

Is your feature request related to a problem? Please describe.

I believe, xarray's dataset.to_zarr() is somewhat inconsitent between creating variables and appending data to existing variables: When creating variables it can deal with writing already encoded data. When appending, it expects decoded data.

When appending data, xarray will always CF-encode variable data according to encoding information of existing variables before it appends new data. This is fine if data to be appended is decoded, but if the data to be appended is already encoded (e.g. because it was previously read by dataset = xr.open_dataset(..., decode_cf=False)) then this leads to entirely corrupt data.

See also xarray issue #5263 and my actual problem described in https://github.com/bcdev/nc2zarr/issues/35.

Describe the solution you'd like

A possible hack is to redundantly use dataset = decode_cf(dataset) before appending so encoding it again is finally a no-op, as described in #5263. This of course also costs extra CPU for a useless computation.

I'd like to control whether encoding of data shall take place when appending. If I already have encoded data, I'd like to call encoded_dataset.to_zarr(..., append_dim='time', encode_cf=False).

For example, when I uncomment line 469 in xarray/backends/zarr.py, then this fixes this issue too:

https://github.com/pydata/xarray/blob/1b4412eeb7011f53932779e1d7c3534163aedd63/xarray/backends/zarr.py#L460-L471

Minimal Complete Verifiable Example:

Here is a test that explains the observed inconsistency.

```python import shutil import unittest

import numpy as np import xarray as xr import zarr

SRC_DS_1_PATH = 'src_ds_1.zarr' SRC_DS_2_PATH = 'src_ds_2.zarr' DST_DS_PATH = 'dst_ds.zarr'

class XarrayToZarrAppendInconsistencyTest(unittest.TestCase): @classmethod def del_paths(cls): for path in (SRC_DS_1_PATH, SRC_DS_2_PATH, DST_DS_PATH): shutil.rmtree(path, ignore_errors=True)

def setUp(self):
    self.del_paths()

    scale_factor = 0.0001
    self.v_values_encoded = np.array([[0, 10000, 15000, 20000]], dtype=np.uint16)
    self.v_values_decoded = np.array([[np.nan, 1., 1.5, 2.]], dtype=np.float32)

    # The variable for the two source datasets
    v = xr.DataArray(self.v_values_encoded,
                     dims=('t', 'x'),
                     attrs=dict(scale_factor=scale_factor, _FillValue=0))

    # Create two source datasets
    src_ds = xr.Dataset(data_vars=dict(v=v))
    src_ds.to_zarr(SRC_DS_1_PATH)
    src_ds.to_zarr(SRC_DS_2_PATH)

    # Assert we have written encoded data
    a1 = zarr.convenience.open_array(SRC_DS_1_PATH + '/v')
    a2 = zarr.convenience.open_array(SRC_DS_2_PATH + '/v')
    np.testing.assert_equal(a1, self.v_values_encoded)  # succeeds
    np.testing.assert_equal(a2, self.v_values_encoded)  # succeeds

    # Assert we correctly decode data
    src_ds_1 = xr.open_zarr(SRC_DS_1_PATH, decode_cf=True)
    src_ds_2 = xr.open_zarr(SRC_DS_2_PATH, decode_cf=True)
    np.testing.assert_equal(src_ds_1.v.data, self.v_values_decoded)  # succeeds
    np.testing.assert_equal(src_ds_2.v.data, self.v_values_decoded)  # succeeds

def tearDown(self):
    self.del_paths()

def test_decode_cf_true(self):
    """
    This test succeeds.
    """
    # Open the two source datasets
    src_ds_1 = xr.open_zarr(SRC_DS_1_PATH, decode_cf=True)
    src_ds_2 = xr.open_zarr(SRC_DS_2_PATH, decode_cf=True)
    # Expect data is decoded
    np.testing.assert_equal(src_ds_1.v.data, self.v_values_decoded)  # succeeds
    np.testing.assert_equal(src_ds_2.v.data, self.v_values_decoded)  # succeeds

    # Write 1st source datasets to new dataset, append the 2nd source
    src_ds_1.to_zarr(DST_DS_PATH, mode='w-')
    src_ds_2.to_zarr(DST_DS_PATH, append_dim='t')

    # Open the new dataset
    dst_ds = xr.open_zarr(DST_DS_PATH, decode_cf=True)
    dst_ds_1 = dst_ds.isel(t=slice(0, 1))
    dst_ds_2 = dst_ds.isel(t=slice(1, 2))
    # Expect data is decoded
    np.testing.assert_equal(dst_ds_1.v.data, self.v_values_decoded)  # succeeds
    np.testing.assert_equal(dst_ds_2.v.data, self.v_values_decoded)  # succeeds

def test_decode_cf_false(self):
    """
    This test fails by the last assertion with

    AssertionError:
    Arrays are not equal

    Mismatched elements: 3 / 4 (75%)
    Max absolute difference: 47600
    Max relative difference: 4.76
     x: array([[    0, 57600, 53632, 49664]], dtype=uint16)
     y: array([[    0, 10000, 15000, 20000]], dtype=uint16)
    """
    # Open the two source datasets
    src_ds_1 = xr.open_zarr(SRC_DS_1_PATH, decode_cf=False)
    src_ds_2 = xr.open_zarr(SRC_DS_2_PATH, decode_cf=False)
    # Expect data is NOT decoded (still encoded)
    np.testing.assert_equal(src_ds_1.v.data, self.v_values_encoded)  # succeeds
    np.testing.assert_equal(src_ds_2.v.data, self.v_values_encoded)  # succeeds

    # Write 1st source datasets to new dataset, append the 2nd source
    src_ds_1.to_zarr(DST_DS_PATH, mode='w-')
    # Avoid ValueError: failed to prevent overwriting existing key scale_factor in attrs. ...
    del src_ds_2.v.attrs['scale_factor']
    del src_ds_2.v.attrs['_FillValue']
    src_ds_2.to_zarr(DST_DS_PATH, append_dim='t')

    # Open the new dataset
    dst_ds = xr.open_zarr(DST_DS_PATH, decode_cf=False)
    dst_ds_1 = dst_ds.isel(t=slice(0, 1))
    dst_ds_2 = dst_ds.isel(t=slice(1, 2))
    # Expect data is NOT decoded (still encoded)
    np.testing.assert_equal(dst_ds_1.v.data, self.v_values_encoded)  # succeeds
    np.testing.assert_equal(dst_ds_2.v.data, self.v_values_encoded)  # fails

```

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 15:50:08) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: de_DE.cp1252 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.17.0 pandas: 1.2.2 numpy: 1.20.1 scipy: 1.6.0 netCDF4: 1.5.6 pydap: installed h5netcdf: None h5py: None Nio: None zarr: 2.6.1 cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.0 cfgrib: None iris: None bottleneck: None dask: 2021.02.0 distributed: 2021.02.0 matplotlib: 3.3.4 cartopy: 0.19.0.post1 seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20210108 pip: 21.0.1 conda: None pytest: 6.2.2 IPython: 7.21.0 sphinx: 3.5.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5405/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 2 rows from issue in issue_comments
Powered by Datasette · Queries took 0.71ms · About: xarray-datasette