id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 922804256,MDU6SXNzdWU5MjI4MDQyNTY=,5475,Is `_FillValue` really the same as zarr's `fill_value`?,6574622,open,0,,,2,2021-06-16T16:03:21Z,2024-04-02T08:17:23Z,,CONTRIBUTOR,,,,"The zarr backend uses the `fill_value` of zarrs `.zarray` key as if it would be the `_FillValue` according to [CF-Conventions](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#missing-data): https://github.com/pydata/xarray/blob/1a7b285be676d5404a4140fc86e8756de75ee7ac/xarray/backends/zarr.py#L373 I think this interpretation of the `fill_value` is wrong and creates problems. Here's why: The [zarr v2 spec](https://zarr.readthedocs.io/en/stable/spec/v2.html#metadata) is still a little vague, but states that `fill_value` is > A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used. Accordingly this value should be used to fill all areas of a variable which are not backed by a stored chunk with this value. This is also different from what [CF conventions state](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#missing-data) (emphasis mine): > The scalar attribute with the name `_FillValue` and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable. **This value is considered to be a special value that indicates undefined or missing data**, and is returned when reading values that were not written. The difference between the two is, that `fill_value` is **only** a background value, which just isn't stored as a chunk. But `_FillValue` is (possibly) a background value **and** is interpreted as not being valid data. In my opinion, this mix of `_FillValue` and `missing_value` could be considered a defect in the CF-Conventions, but probably that's far to late as many depend on this. Thinking of an example, when storing a density field (i.e. water droplets forming clouds) in a zarr dataset, it might be perfectly valid to set the `fill_value` to `0` and then store only chunks in regions of the atmosphere where clouds are actually present. In that case, `0` (i.e. no drops) would be a perfectly valid value, which just isn't stored. As most parts of the atmosphere are indeed cloud-free, this may save quite a bunch of storage. Other formats (e.g. [OpenVDB](https://www.openvdb.org)) commonly use this trick. --- The issue gets worse when looking into the upcoming [zarr v3 spec](https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#array-metadata) where `fill_value` is described as: > Provides an element value to use for uninitialised portions of the Zarr array. > > If the data type of the Zarr array is Boolean then the value must be the literal `false` or `true`. If the data type is one of the integer data types defined in this specification, then the value must be a number with no fraction or exponent part and must be within the range of the data type. > > For any data type, if the `fill_value` is the literal `null` then the fill value is undefined and the implementation may use any arbitrary value that is consistent with the data type as the fill value. > > [...] Thus for boolean arrays, if the `fill_value` would be interpreted as a missing value indicator, only (missing, `True`) or (`False`, missing) arrays could be represented. A (`False`, `True`) array would not be possible. The issue applies similarly for integer types as well. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5475/reactions"", ""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 1159923690,I_kwDOAMm_X85FIwfq,6329,`to_zarr` with append or region mode and `_FillValue` doesnt work,6574622,open,0,,,17,2022-03-04T18:21:32Z,2023-03-17T16:14:30Z,,CONTRIBUTOR,,,,"### What happened? ```python import numpy as np import xarray as xr ds = xr.Dataset({""a"": (""x"", [3.], {""_FillValue"": np.nan})}) m = {} ds.to_zarr(m) ds.to_zarr(m, append_dim=""x"") ``` raises ``` ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually. ``` ### What did you expect to happen? I'd expect this to just work (effectively concatenating the dataset to itself). ### Anything else we need to know? #### appears also for `region` writes The same issue appears for region writes as in: ```python import numpy as np import dask.array as da import xarray as xr ds = xr.Dataset({""a"": (""x"", da.array([3.,4.]), {""_FillValue"": np.nan})}) m = {} ds.to_zarr(m, compute=False, encoding={""a"": {""chunks"": (1,)}}) ds.isel(x=slice(0,1)).to_zarr(m, region={""x"": slice(0,1)}) ``` raises ``` ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually. ``` #### there's a workaround The workaround (deleting the `_FillValue` in subsequent writes): ```python m = {} ds.to_zarr(m) del ds.a.attrs[""_FillValue""] ds.to_zarr(m, append_dim=""x"") ``` seems to do the trick. [There are indications that the result might still be broken](https://github.com/pydata/xarray/issues/6069#issuecomment-1059400265), but it's not yet clear how to reproduce them (see comments below). This issue has been split off from #6069
Environment INSTALLED VERSIONS ------------------ commit: None python: 3.9.10 (main, Jan 15 2022, 11:48:00) [Clang 13.0.0 (clang-1300.0.29.3)] python-bits: 64 OS: Darwin OS-release: 20.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: ('de_DE', 'UTF-8') libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 0.20.1 pandas: 1.2.0 numpy: 1.21.2 scipy: 1.6.2 netCDF4: 1.5.8 pydap: installed h5netcdf: 0.11.0 h5py: 3.2.1 Nio: None zarr: 2.11.0 cftime: 1.3.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: None dask: 2021.11.1 distributed: 2021.11.1 matplotlib: 3.4.1 cartopy: 0.20.1 seaborn: 0.11.1 numbagg: None fsspec: 2021.11.1 cupy: None pint: 0.17 sparse: 0.13.0 setuptools: 60.5.0 pip: 21.3.1 conda: None pytest: 6.2.2 IPython: 8.0.0.dev sphinx: 3.5.0
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6329/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue