html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/6069#issuecomment-1059405550,https://api.github.com/repos/pydata/xarray/issues/6069,1059405550,IC_kwDOAMm_X84_JT7u,6574622,2022-03-04T18:16:57Z,2022-03-04T18:16:57Z,CONTRIBUTOR,"I'll set up a new issue. @Boorhin, I couldn't confirm the weirdness with the small example, but will put in a note to your comment. If you can reproduce the weirdness on the minimal example, would you make a comment to the new issue?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208
https://github.com/pydata/xarray/issues/6069#issuecomment-1059378287,https://api.github.com/repos/pydata/xarray/issues/6069,1059378287,IC_kwDOAMm_X84_JNRv,6574622,2022-03-04T17:39:24Z,2022-03-04T17:39:24Z,CONTRIBUTOR,"I've made a simpler example of the `_FillValue` - append issue:
```python
import numpy as np
import xarray as xr
ds = xr.Dataset({""a"": (""x"", [3.], {""_FillValue"": np.nan})})
m = {}
ds.to_zarr(m)
ds.to_zarr(m, append_dim=""x"")
```
raises
```
ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.
```
I'd expect this to just work (effectively concatenating the dataset to itself).
The workaround:
```python
m = {}
ds.to_zarr(m)
del ds.a.attrs[""_FillValue""]
ds.to_zarr(m, append_dim=""x"")
```
does the trick, but doesn't look right.
@dcherian, @Boorhin should we make a new (CF-related) issue out of this and try to keep focussing on append and region use-cases here, which seemed to be the initial problem in this thread (probably by going further through your example @Boorhin?).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208
https://github.com/pydata/xarray/issues/6069#issuecomment-1059078961,https://api.github.com/repos/pydata/xarray/issues/6069,1059078961,IC_kwDOAMm_X84_IEMx,6574622,2022-03-04T11:27:12Z,2022-03-04T11:27:44Z,CONTRIBUTOR,"btw, as a work-around it works when removing the `_FillValue` from `dst.air` (you'll likely only want to do this for the append-writes, not the initial write):
```python
del dst.air.attrs[""_FillValue""]
dst.to_zarr(m, append_dim=""time"")
```
works.
But still, this might call for another issue to solve.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208
https://github.com/pydata/xarray/issues/6069#issuecomment-1059076885,https://api.github.com/repos/pydata/xarray/issues/6069,1059076885,IC_kwDOAMm_X84_IDsV,6574622,2022-03-04T11:23:56Z,2022-03-04T11:23:56Z,CONTRIBUTOR,"Ok, I believe, I've now reproduced your error:
```python
import xarray as xr
from rasterio.enums import Resampling
import numpy as np
ds = xr.tutorial.open_dataset('air_temperature').isel(time=0)
ds = ds.rio.write_crs('EPSG:4326')
dst = ds.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan)
dst.air.encoding = {}
dst = dst.assign(air=dst.air.expand_dims(""time""), time=dst.time.expand_dims(""time""))
m = {}
dst.to_zarr(m)
dst.to_zarr(m, append_dim=""time"")
```
raises:
```
ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.
```
This seems to be due to handling of CF-Conventions which might go wrong in the append case: the `CFMaskCoder` verifies that there isn't any fill value present in the dataset before defining one [here](https://github.com/pydata/xarray/blob/f42ac28629b7b2047f859f291e1d755c36f2e834/xarray/coding/variables.py#L166). I'd guess in the append case, one wouldn't want to check if the fill value is already defined, but instead want to check that it is the same.
However, I don't know a lot about the CF encoding pieces of xarray...
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208
https://github.com/pydata/xarray/issues/6069#issuecomment-1059063397,https://api.github.com/repos/pydata/xarray/issues/6069,1059063397,IC_kwDOAMm_X84_IAZl,6574622,2022-03-04T11:05:07Z,2022-03-04T11:05:07Z,CONTRIBUTOR,"This error ist unrelated to region or append writes. The dataset `dst` got the `_FillValue` attribute from `rio.reproject`
```
>>> dst.air.attrs
{...
'_FillValue': nan}
```
but still carries encoding-information from `ds`, i.e.:
```
>>> dst.air.encoding
{'source': '...air_temperature.nc',
'original_shape': (2920, 25, 53),
'dtype': dtype('int16'),
'scale_factor': 0.01,
'grid_mapping': 'spatial_ref'}
```
The encoding get's picked up by `to_zarr`, but as `nan` (the `_FillValue` from `rio.reproject`) can't be expressed as an `int16`, it's not possible to write that data.
You'll have to get rid of the encoding or specify some encoding and `_FillValue` which fit together. #5219 might be related.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208
https://github.com/pydata/xarray/issues/6069#issuecomment-1059025444,https://api.github.com/repos/pydata/xarray/issues/6069,1059025444,IC_kwDOAMm_X84_H3Ik,6574622,2022-03-04T10:13:40Z,2022-03-04T10:13:40Z,CONTRIBUTOR,🤷 can't help any further without a minimal reproducible example here...,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208
https://github.com/pydata/xarray/issues/6069#issuecomment-1058381922,https://api.github.com/repos/pydata/xarray/issues/6069,1058381922,IC_kwDOAMm_X84_FaBi,6574622,2022-03-03T18:56:13Z,2022-03-03T18:56:13Z,CONTRIBUTOR,"I don't yet know a proper answer, but there'd be three observations I have:
* The `ValueError` seems to be related to the handling of CF-Conventions. I don't yet know if that's independent of this issue or if the error only appears in conjunction with this issue.
* As far as I understand, appending should be possible without dropping anything (while potentially overwriting some things).
* It shouldn't be possible to change `_FillValue` during appends, because that might require rewriting everything previously written, which you likely don't want. So if `_FillValue` is different on the append-call, I'd want `xarray` to produce an error.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208
https://github.com/pydata/xarray/issues/6069#issuecomment-1052252098,https://api.github.com/repos/pydata/xarray/issues/6069,1052252098,IC_kwDOAMm_X84-uBfC,6574622,2022-02-26T16:07:56Z,2022-02-26T16:07:56Z,CONTRIBUTOR,"While testing a bit further, I found another case which might potentially be dangerous:
```python
# ds is the same as above, but chunksize is {""time"": 1, ""x"": 1}
# once on the coordinator
ds.to_zarr(""test.zarr"", compute=False, encoding={""time"": {""chunks"": [1]}, ""x"": {""chunks"": [1]}})
# in parallel
ds.isel(time=slice(0,1), x=slice(0,1)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(0,1), ""x"": slice(0,1)})
ds.isel(time=slice(0,1), x=slice(1,2)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(0,1), ""x"": slice(1,2)})
ds.isel(time=slice(0,1), x=slice(2,3)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(0,1), ""x"": slice(2,3)})
ds.isel(time=slice(1,2), x=slice(0,1)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(1,2), ""x"": slice(0,1)})
ds.isel(time=slice(1,2), x=slice(1,2)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(1,2), ""x"": slice(1,2)})
ds.isel(time=slice(1,2), x=slice(2,3)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(1,2), ""x"": slice(2,3)})
```
This example doesn't produce any error, but the `time` and `x` coordinates are re-written multiple times without any warning. However, I don't yet know how a proper error / warning should be generated in this case. Maybe the check must be if every written variable touches *all* region-ed dimensions? But maybe thats overly restrictive?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208
https://github.com/pydata/xarray/issues/6069#issuecomment-1052240616,https://api.github.com/repos/pydata/xarray/issues/6069,1052240616,IC_kwDOAMm_X84-t-ro,6574622,2022-02-26T15:58:48Z,2022-02-26T15:58:48Z,CONTRIBUTOR,"I'm trying to picture some usage scenarios based on incrementally adding timesteps to data on store. I hope these might help to answer questions from above. In particular, I think that `append` and `region` options of `to_zarr` will imply different usage patterns, so might lead to different answers, and mixing terms might lead to confusion.
I'll use the following dataset for demonstration code:
```python
ds = xr.Dataset({
""T"": ((""time"", ""x""), [[1.,2.,3.],[11.,12.,13.]]),
}, coords={
""time"": ((""time"",), [21., 22.]),
""x"": ((""x"",), [100., 200., 300.])
}).chunk({""time"": 1})
```
## `append`
The purpose of `append` is to add (one or many) elements along one dimension after the end of all currently existing elements. This implies a read-modify-write cycle to at least the total shape of the array. Furthermore, the place to write new chunks is determined by the current shape of the existing array. Due to these implications, it doesn't seem to be useful to try `append` in parallel (it would become ambiguous where to write) and it doesn't seem to be too useful (but possible) to only write *some* of the variables defined on the append-dimension, because all other variables would implicitly be filled with `fill_value` and those places couldn't be filled with another `append` anymore.
As a consquence, append-mode writes will always have to be **sequential** and writes to objects shared touched by multiple append calls will always have a defined behaviour, even if they are modified / overwritten with each call. Creating and appending works as follows:
```python
# writes 0-sized time-dimension, so only metadata and non-time dependent variables
ds.isel(time=slice(0,0)).to_zarr(""test_append.zarr"")
!tree -a test_append.zarr
ds.isel(time=slice(0,1)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"")
ds.isel(time=slice(1,2)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"")
print()
print(""final dataset:"")
!tree -a test_append.zarr
```
Output
```
test_append.zarr
├── .zattrs
├── .zgroup
├── .zmetadata
├── T
│ ├── .zarray
│ └── .zattrs
├── time
│ ├── .zarray
│ └── .zattrs
└── x
├── .zarray
├── .zattrs
└── 0
3 directories, 10 files
final dataset:
test_append.zarr
├── .zattrs
├── .zgroup
├── .zmetadata
├── T
│ ├── .zarray
│ ├── .zattrs
│ ├── 0.0
│ └── 1.0
├── time
│ ├── .zarray
│ ├── .zattrs
│ ├── 0
│ └── 1
└── x
├── .zarray
├── .zattrs
└── 0
3 directories, 14 files
```
In this case, `x` would be overwritten with each append call, but the behaviour is well defined as we will only ever append sequentially, so whatever the last write writes into `x` will be the final result, e.g. `[1, 2, 3]` in the following case:
```python
ds.isel(time=slice(0,1)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"")
ds2 = ds.assign(x=[1,2,3])
ds2.isel(time=slice(1,2)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"")
```
If instead, `x` shouldn't be overwritten, it's possible to append using:
```python
ds.drop([""x""]).isel(time=slice(0,1)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"")
ds.drop([""x""]).isel(time=slice(1,2)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"")
```
This also works already with current `xarray` and has well defined behaviour. However, if there are many `time`-independent variables, it might be easier if something like `.drop_if_not(""time"")` or something similar would be available.
## `region`
`region` behaves quite differently from `append`. It does not modify the shape of the arrays and it does not depend on the shape's value to determine where to write new data (it requires user input to do so). This generally enables **parallel** writes to the same dataset (if only distinct chunks are touched). But as metadata (e.g. shape) is still shared, updates to metadata must be done in a coordinated (likely sequential) manner.
Generally, the workflow with `region` would imply writing the metadata once and maybe update it from time to time but sequentially (e.g. on a coordinating node) and write all the chunks in parallel on worker nodes, while carefully ensuring that no common chunks are overwritten. Let's see how this might look like:
```python
ds.to_zarr(""test.zarr"", compute=False, encoding={""time"": {""chunks"": [1]}})
!rm test.zarr/time/0
!rm test.zarr/time/1
!tree -a test.zarr
# NOTE: these may run in parallel (even if that's not useful in time, but region might also be in time and space)
ds.drop(['x']).isel(time=slice(0,1)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(0,1)})
ds.drop(['x']).isel(time=slice(1,2)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(1,2)})
print()
print(""final dataset:"")
!tree -a test.zarr
```
Output
```
test.zarr
├── .zattrs
├── .zgroup
├── .zmetadata
├── T
│ ├── .zarray
│ └── .zattrs
├── time
│ ├── .zarray
│ └── .zattrs
└── x
├── .zarray
├── .zattrs
└── 0
3 directories, 10 files
final dataset:
test.zarr
├── .zattrs
├── .zgroup
├── .zmetadata
├── T
│ ├── .zarray
│ ├── .zattrs
│ ├── 0.0
│ └── 1.0
├── time
│ ├── .zarray
│ ├── .zattrs
│ ├── 0
│ └── 1
└── x
├── .zarray
├── .zattrs
└── 0
3 directories, 14 files
```
The above works and as far as I understand does what we'd want for parallel writes. It also avoids the mentioned ambiguous cases (due to the `drop(['x'])` statements). However this case is even more cumbersome to write than in the append case. The parallel writes might benefit from again from something like `.drop_if_not(""time"")` (which probably can't be optional in this case due to ambiguity). But what's even more problematic is the initial write of array metadata. In order to start building the dataset, I'll have to scaffold an (potentially not yet computed) Dataset of full size and use `compute=False` to write only metadata. However, this fails for coordinate variables (like time), because those are eagerly loaded and will still be written out. That's why I've removed those chunks in the example above.
If `region` should be used for parallel append, then there must be some process on a coordinating node which updates the metadata keys (at least by increasing the shape). I don't yet see how that could be written nicely using xarray.
---
So based on these two kinds of tasks, it seems to me that the actual `append` and `region` write-modes of `to_zarr` are already doing what they should do, but there could be some more convenience functions which would make those tasks much simpler:
* some method like `drop_if_not` (maybe with a better name) which drops all the things we don't want to keep (maybe we should call it `keep` instead of `drop`). This method would essentially result in and simplify mode 1 in @shoyer's answer, which I'd argue is what we actually want in both use cases, becasue the dropped data would already have been written by the coordinating process. I'd believe that mode 1 shouldn't be the default for `to_zarr` though, because silently dropping data from being written isn't nice to the user.
* some better tooling for writing and updating zarr dataset metadata (I don't know if that would fit in the realm of `xarray` though, as it looks like handling Datasets without content. For ""appending"" metadata, I really don't know how I'd picture this propery in `xarray` world.)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208