id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
922804256,MDU6SXNzdWU5MjI4MDQyNTY=,5475,Is `_FillValue` really the same as zarr's `fill_value`?,6574622,open,0,,,2,2021-06-16T16:03:21Z,2024-04-02T08:17:23Z,,CONTRIBUTOR,,,,"The zarr backend uses the `fill_value` of zarrs `.zarray` key as if it would be the `_FillValue` according to [CF-Conventions](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#missing-data):

https://github.com/pydata/xarray/blob/1a7b285be676d5404a4140fc86e8756de75ee7ac/xarray/backends/zarr.py#L373

I think this interpretation of the `fill_value` is wrong and creates problems. Here's why:

The [zarr v2 spec](https://zarr.readthedocs.io/en/stable/spec/v2.html#metadata) is still a little vague, but states that `fill_value` is
> A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.

Accordingly this value should be used to fill all areas of a variable which are not backed by a stored chunk with this value. This is also different from what [CF conventions state](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#missing-data) (emphasis mine):
> The scalar attribute with the name `_FillValue` and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable. **This value is considered to be a special value that indicates undefined or missing data**, and is returned when reading values that were not written.

The difference between the two is, that `fill_value` is **only** a background value, which just isn't stored as a chunk. But `_FillValue` is (possibly) a background value **and** is interpreted as not being valid data. In my opinion, this mix of `_FillValue` and `missing_value` could be considered a defect in the CF-Conventions, but probably that's far to late as many depend on this.

Thinking of an example, when storing a density field (i.e. water droplets forming clouds) in a zarr dataset, it might be perfectly valid to set the `fill_value` to `0` and then store only chunks in regions of the atmosphere where clouds are actually present. In that case, `0` (i.e. no drops) would be a perfectly valid value, which just isn't stored. As most parts of the atmosphere are indeed cloud-free, this may save quite a bunch of storage. Other formats (e.g. [OpenVDB](https://www.openvdb.org)) commonly use this trick.

---

The issue gets worse when looking into the upcoming [zarr v3 spec](https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#array-metadata) where `fill_value` is described as:
> Provides an element value to use for uninitialised portions of the Zarr array.
> 
> If the data type of the Zarr array is Boolean then the value must be the literal `false` or `true`. If the data type is one of the integer data types defined in this specification, then the value must be a number with no fraction or exponent part and must be within the range of the data type.
> 
> For any data type, if the `fill_value` is the literal `null` then the fill value is undefined and the implementation may use any arbitrary value that is consistent with the data type as the fill value.
> 
> [...]

Thus for boolean arrays, if the `fill_value` would be interpreted as a missing value indicator, only (missing, `True`) or (`False`, missing) arrays could be represented. A (`False`, `True`) array would not be possible. The issue applies similarly for integer types as well.

","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5475/reactions"", ""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1159923690,I_kwDOAMm_X85FIwfq,6329,`to_zarr` with append or region mode and `_FillValue` doesnt work,6574622,open,0,,,17,2022-03-04T18:21:32Z,2023-03-17T16:14:30Z,,CONTRIBUTOR,,,,"### What happened?

```python
import numpy as np
import xarray as xr
ds = xr.Dataset({""a"": (""x"", [3.], {""_FillValue"": np.nan})})
m = {}
ds.to_zarr(m)
ds.to_zarr(m, append_dim=""x"")
```
raises
```
ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.
```

### What did you expect to happen?

I'd expect this to just work (effectively concatenating the dataset to itself).

### Anything else we need to know?

#### appears also for `region` writes
The same issue appears for region writes as in:
```python
import numpy as np
import dask.array as da
import xarray as xr
ds = xr.Dataset({""a"": (""x"", da.array([3.,4.]), {""_FillValue"": np.nan})})
m = {}
ds.to_zarr(m, compute=False, encoding={""a"": {""chunks"": (1,)}})
ds.isel(x=slice(0,1)).to_zarr(m, region={""x"": slice(0,1)})
```
raises

```
ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.
```

#### there's a workaround
The workaround (deleting the `_FillValue` in subsequent writes):
```python
m = {}
ds.to_zarr(m)
del ds.a.attrs[""_FillValue""]
ds.to_zarr(m, append_dim=""x"")
```
seems to do the trick.

[There are indications that the result might still be broken](https://github.com/pydata/xarray/issues/6069#issuecomment-1059400265), but it's not yet clear how to reproduce them (see comments below).

This issue has been split off from #6069

<details><summary>Environment</summary>

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.10 (main, Jan 15 2022, 11:48:00) 
[Clang 13.0.0 (clang-1300.0.29.3)]
python-bits: 64
OS: Darwin
OS-release: 20.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4

xarray: 0.20.1
pandas: 1.2.0
numpy: 1.21.2
scipy: 1.6.2
netCDF4: 1.5.8
pydap: installed
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: 2.11.0
cftime: 1.3.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: None
iris: None
bottleneck: None
dask: 2021.11.1
distributed: 2021.11.1
matplotlib: 3.4.1
cartopy: 0.20.1
seaborn: 0.11.1
numbagg: None
fsspec: 2021.11.1
cupy: None
pint: 0.17
sparse: 0.13.0
setuptools: 60.5.0
pip: 21.3.1
conda: None
pytest: 6.2.2
IPython: 8.0.0.dev
sphinx: 3.5.0
</details>","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6329/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue