html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/6069#issuecomment-1059405550,https://api.github.com/repos/pydata/xarray/issues/6069,1059405550,IC_kwDOAMm_X84_JT7u,6574622,2022-03-04T18:16:57Z,2022-03-04T18:16:57Z,CONTRIBUTOR,"I'll set up a new issue. @Boorhin, I couldn't confirm the weirdness with the small example, but will put in a note to your comment. If you can reproduce the weirdness on the minimal example, would you make a comment to the new issue?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059403646,https://api.github.com/repos/pydata/xarray/issues/6069,1059403646,IC_kwDOAMm_X84_JTd-,2448579,2022-03-04T18:14:18Z,2022-03-04T18:14:18Z,MEMBER,:+1: to creating a new issue with your minimal example (I think we're just missing a check whether the Dataset and on-disk fill values are equal). It did seem like there were two issues mixed up here. Thanks for confirming that.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059400265,https://api.github.com/repos/pydata/xarray/issues/6069,1059400265,IC_kwDOAMm_X84_JSpJ,9576982,2022-03-04T18:09:44Z,2022-03-04T18:10:49Z,NONE,"@d70-t we can try to branch it to the CF related issue yes. The `del` method is the one I tried and when doing it on my files I had very weird things happening so I would not recommend it as a proper workaround. as I wrote before it was not appending the file as it should have. I have now a run functioning with the `region` method but I had to simulate my whole file which was a bit challenging and is actually pretty easy to break as I need to use the geometry of a single variable to generate my temporal and spatial coordinates for the whole archive. Going through the whole variables is a bit of a no-go. The initialisation with both methods is really a challenge I find. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059378287,https://api.github.com/repos/pydata/xarray/issues/6069,1059378287,IC_kwDOAMm_X84_JNRv,6574622,2022-03-04T17:39:24Z,2022-03-04T17:39:24Z,CONTRIBUTOR,"I've made a simpler example of the `_FillValue` - append issue: ```python import numpy as np import xarray as xr ds = xr.Dataset({""a"": (""x"", [3.], {""_FillValue"": np.nan})}) m = {} ds.to_zarr(m) ds.to_zarr(m, append_dim=""x"") ``` raises ``` ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually. ``` I'd expect this to just work (effectively concatenating the dataset to itself). The workaround: ```python m = {} ds.to_zarr(m) del ds.a.attrs[""_FillValue""] ds.to_zarr(m, append_dim=""x"") ``` does the trick, but doesn't look right. @dcherian, @Boorhin should we make a new (CF-related) issue out of this and try to keep focussing on append and region use-cases here, which seemed to be the initial problem in this thread (probably by going further through your example @Boorhin?).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059274384,https://api.github.com/repos/pydata/xarray/issues/6069,1059274384,IC_kwDOAMm_X84_Iz6Q,9576982,2022-03-04T15:42:36Z,2022-03-04T15:42:36Z,NONE,"I have tried to specify the chunk before writing the dataset and I have had some really strange behaviour with data written into the same chunks, the time dimension never went over 5, growing and reducing through the processing...","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059121536,https://api.github.com/repos/pydata/xarray/issues/6069,1059121536,IC_kwDOAMm_X84_IOmA,9576982,2022-03-04T12:30:01Z,2022-03-04T12:30:01Z,NONE,"Effectively I have unstable results with sometimes errors with timesteps refusing to write I systematically have this warning ``` python /opt/conda/lib/python3.7/site-packages/xarray/core/dataset.py:2050: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs safe_chunks=safe_chunks, ``` the crashes are related to dimension of time itself but time is always of size 1, so it is hard to understand ``` python /tmp/ipykernel_1629/1269180709.py in aggregate_with_time(farm_name, resolution_M, canvas, W, H, master_raster_coordinates) 39 raster.drop( 40 ['x','y']).to_zarr( ---> 41 uri, mode='a', append_dim='time') 42 #except: 43 #print('something went wrong') /opt/conda/lib/python3.7/site-packages/xarray/core/dataset.py in to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options) 2048 append_dim=append_dim, 2049 region=region, -> 2050 safe_chunks=safe_chunks, 2051 ) 2052 /opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options) 1406 _validate_datatypes_for_zarr_append(dataset) 1407 if append_dim is not None: -> 1408 existing_dims = zstore.get_dimensions() 1409 if append_dim not in existing_dims: 1410 raise ValueError( /opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in get_dimensions(self) 450 if d in dimensions and dimensions[d] != s: 451 raise ValueError( --> 452 f""found conflicting lengths for dimension {d} "" 453 f""({s} != {dimensions[d]})"" 454 ) ValueError: found conflicting lengths for dimension time (2 != 1) ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059078961,https://api.github.com/repos/pydata/xarray/issues/6069,1059078961,IC_kwDOAMm_X84_IEMx,6574622,2022-03-04T11:27:12Z,2022-03-04T11:27:44Z,CONTRIBUTOR,"btw, as a work-around it works when removing the `_FillValue` from `dst.air` (you'll likely only want to do this for the append-writes, not the initial write): ```python del dst.air.attrs[""_FillValue""] dst.to_zarr(m, append_dim=""time"") ``` works. But still, this might call for another issue to solve.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059078276,https://api.github.com/repos/pydata/xarray/issues/6069,1059078276,IC_kwDOAMm_X84_IECE,9576982,2022-03-04T11:26:04Z,2022-03-04T11:26:04Z,NONE,"In my case I specify _fillvalue in the reprojection so I would not think this is an issue to overwrite it. I just don't know how to do it","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059076885,https://api.github.com/repos/pydata/xarray/issues/6069,1059076885,IC_kwDOAMm_X84_IDsV,6574622,2022-03-04T11:23:56Z,2022-03-04T11:23:56Z,CONTRIBUTOR,"Ok, I believe, I've now reproduced your error: ```python import xarray as xr from rasterio.enums import Resampling import numpy as np ds = xr.tutorial.open_dataset('air_temperature').isel(time=0) ds = ds.rio.write_crs('EPSG:4326') dst = ds.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan) dst.air.encoding = {} dst = dst.assign(air=dst.air.expand_dims(""time""), time=dst.time.expand_dims(""time"")) m = {} dst.to_zarr(m) dst.to_zarr(m, append_dim=""time"") ``` raises: ``` ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually. ``` This seems to be due to handling of CF-Conventions which might go wrong in the append case: the `CFMaskCoder` verifies that there isn't any fill value present in the dataset before defining one [here](https://github.com/pydata/xarray/blob/f42ac28629b7b2047f859f291e1d755c36f2e834/xarray/coding/variables.py#L166). I'd guess in the append case, one wouldn't want to check if the fill value is already defined, but instead want to check that it is the same. However, I don't know a lot about the CF encoding pieces of xarray... ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059063397,https://api.github.com/repos/pydata/xarray/issues/6069,1059063397,IC_kwDOAMm_X84_IAZl,6574622,2022-03-04T11:05:07Z,2022-03-04T11:05:07Z,CONTRIBUTOR,"This error ist unrelated to region or append writes. The dataset `dst` got the `_FillValue` attribute from `rio.reproject` ``` >>> dst.air.attrs {... '_FillValue': nan} ``` but still carries encoding-information from `ds`, i.e.: ``` >>> dst.air.encoding {'source': '...air_temperature.nc', 'original_shape': (2920, 25, 53), 'dtype': dtype('int16'), 'scale_factor': 0.01, 'grid_mapping': 'spatial_ref'} ``` The encoding get's picked up by `to_zarr`, but as `nan` (the `_FillValue` from `rio.reproject`) can't be expressed as an `int16`, it's not possible to write that data. You'll have to get rid of the encoding or specify some encoding and `_FillValue` which fit together. #5219 might be related.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059052257,https://api.github.com/repos/pydata/xarray/issues/6069,1059052257,IC_kwDOAMm_X84_H9rh,9576982,2022-03-04T10:50:09Z,2022-03-04T10:50:09Z,NONE,"OK that's not exactly the same error message, I could not even start the appending. But that's basically one example that could be tested. A model would want to compute each of these variables step by step and variable by variable and save them for each single iteration. There is no need of concurrent writing as most of the resources are focused on the modelling. ``` python import xarray as xr from rasterio.enums import Resampling import numpy as np ds = xr.tutorial.open_dataset('air_temperature').isel(time=0) ds = ds.rio.write_crs('EPSG:4326') dst = ds.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan) dst.to_zarr('test.zarr') ``` Returns > --------------------------------------------------------------------------- > ValueError Traceback (most recent call last) > /opt/conda/lib/python3.7/site-packages/zarr/util.py in normalize_fill_value(fill_value, dtype) > 277 else: > --> 278 fill_value = np.array(fill_value, dtype=dtype)[()] > 279 > > ValueError: cannot convert float NaN to integer > > During handling of the above exception, another exception occurred: > > ValueError Traceback (most recent call last) > /tmp/ipykernel_2604/3259577033.py in > ----> 1 dst.to_zarr('test.zarr') > > /opt/conda/lib/python3.7/site-packages/xarray/core/dataset.py in to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options) > 2048 append_dim=append_dim, > 2049 region=region, > -> 2050 safe_chunks=safe_chunks, > 2051 ) > 2052 > > /opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options) > 1429 writer = ArrayWriter() > 1430 # TODO: figure out how to properly handle unlimited_dims > -> 1431 dump_to_store(dataset, zstore, writer, encoding=encoding) > 1432 writes = writer.sync(compute=compute) > 1433 > > /opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims) > 1117 variables, attrs = encoder(variables, attrs) > 1118 > -> 1119 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) > 1120 > 1121 > > /opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims) > 549 > 550 self.set_variables( > --> 551 variables_encoded, check_encoding_set, writer, unlimited_dims=unlimited_dims > 552 ) > 553 if self._consolidate_on_close: > > /opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in set_variables(self, variables, check_encoding_set, writer, unlimited_dims) > 607 dtype = str > 608 zarr_array = self.zarr_group.create( > --> 609 name, shape=shape, dtype=dtype, fill_value=fill_value, **encoding > 610 ) > 611 zarr_array.attrs.put(encoded_attrs) > > /opt/conda/lib/python3.7/site-packages/zarr/hierarchy.py in create(self, name, **kwargs) > 889 """"""Create an array. Keyword arguments as per > 890 :func:`zarr.creation.create`."""""" > --> 891 return self._write_op(self._create_nosync, name, **kwargs) > 892 > 893 def _create_nosync(self, name, **kwargs): > > /opt/conda/lib/python3.7/site-packages/zarr/hierarchy.py in _write_op(self, f, *args, **kwargs) > 659 > 660 with lock: > --> 661 return f(*args, **kwargs) > 662 > 663 def create_group(self, name, overwrite=False): > > /opt/conda/lib/python3.7/site-packages/zarr/hierarchy.py in _create_nosync(self, name, **kwargs) > 896 kwargs.setdefault('cache_attrs', self.attrs.cache) > 897 return create(store=self._store, path=path, chunk_store=self._chunk_store, > --> 898 **kwargs) > 899 > 900 def empty(self, name, **kwargs): > > /opt/conda/lib/python3.7/site-packages/zarr/creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, dimension_separator, **kwargs) > 139 fill_value=fill_value, order=order, overwrite=overwrite, path=path, > 140 chunk_store=chunk_store, filters=filters, object_codec=object_codec, > --> 141 dimension_separator=dimension_separator) > 142 > 143 # instantiate array > > /opt/conda/lib/python3.7/site-packages/zarr/storage.py in init_array(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec, dimension_separator) > 356 chunk_store=chunk_store, filters=filters, > 357 object_codec=object_codec, > --> 358 dimension_separator=dimension_separator) > 359 > 360 > > /opt/conda/lib/python3.7/site-packages/zarr/storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec, dimension_separator) > 392 chunks = normalize_chunks(chunks, shape, dtype.itemsize) > 393 order = normalize_order(order) > --> 394 fill_value = normalize_fill_value(fill_value, dtype) > 395 > 396 # optional array metadata > > /opt/conda/lib/python3.7/site-packages/zarr/util.py in normalize_fill_value(fill_value, dtype) > 281 # re-raise with our own error message to be helpful > 282 raise ValueError('fill_value {!r} is not valid for dtype {}; nested ' > --> 283 'exception: {}'.format(fill_value, dtype, e)) > 284 > 285 return fill_value > > ValueError: fill_value nan is not valid for dtype int16; nested exception: cannot convert float NaN to integer","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059025444,https://api.github.com/repos/pydata/xarray/issues/6069,1059025444,IC_kwDOAMm_X84_H3Ik,6574622,2022-03-04T10:13:40Z,2022-03-04T10:13:40Z,CONTRIBUTOR,🤷 can't help any further without a minimal reproducible example here...,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1059022639,https://api.github.com/repos/pydata/xarray/issues/6069,1059022639,IC_kwDOAMm_X84_H2cv,9576982,2022-03-04T10:10:08Z,2022-03-04T10:10:08Z,NONE,"The _FillValue is always the same (np.nan) and specified when I reproject with rioxarray. so I don't understand the first error then. The thing is that the _fillvalue is attached to a variable not the whole dataset. But it never change. Not too sure what to do","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1058381922,https://api.github.com/repos/pydata/xarray/issues/6069,1058381922,IC_kwDOAMm_X84_FaBi,6574622,2022-03-03T18:56:13Z,2022-03-03T18:56:13Z,CONTRIBUTOR,"I don't yet know a proper answer, but there'd be three observations I have: * The `ValueError` seems to be related to the handling of CF-Conventions. I don't yet know if that's independent of this issue or if the error only appears in conjunction with this issue. * As far as I understand, appending should be possible without dropping anything (while potentially overwriting some things). * It shouldn't be possible to change `_FillValue` during appends, because that might require rewriting everything previously written, which you likely don't want. So if `_FillValue` is different on the append-call, I'd want `xarray` to produce an error.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1058323632,https://api.github.com/repos/pydata/xarray/issues/6069,1058323632,IC_kwDOAMm_X84_FLyw,9576982,2022-03-03T17:54:27Z,2022-03-03T17:54:27Z,NONE,"I did make ds.attrs={} but at each appending I get a warning ``` /opt/conda/lib/python3.7/site-packages/xarray/core/dataset.py:2050: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs safe_chunks=safe_chunks, ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1058315108,https://api.github.com/repos/pydata/xarray/issues/6069,1058315108,IC_kwDOAMm_X84_FJtk,9576982,2022-03-03T17:45:15Z,2022-03-03T17:45:15Z,NONE,"I have looked at these examples and I still don't manage to make it work in the real world. I find append the most logical but I have attributes attached to a dataset that I don't seem to be able to drop before appending. This generates this error: `ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.` However, I cannot find a way of getting rid of this attribute","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1052252098,https://api.github.com/repos/pydata/xarray/issues/6069,1052252098,IC_kwDOAMm_X84-uBfC,6574622,2022-02-26T16:07:56Z,2022-02-26T16:07:56Z,CONTRIBUTOR,"While testing a bit further, I found another case which might potentially be dangerous: ```python # ds is the same as above, but chunksize is {""time"": 1, ""x"": 1} # once on the coordinator ds.to_zarr(""test.zarr"", compute=False, encoding={""time"": {""chunks"": [1]}, ""x"": {""chunks"": [1]}}) # in parallel ds.isel(time=slice(0,1), x=slice(0,1)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(0,1), ""x"": slice(0,1)}) ds.isel(time=slice(0,1), x=slice(1,2)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(0,1), ""x"": slice(1,2)}) ds.isel(time=slice(0,1), x=slice(2,3)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(0,1), ""x"": slice(2,3)}) ds.isel(time=slice(1,2), x=slice(0,1)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(1,2), ""x"": slice(0,1)}) ds.isel(time=slice(1,2), x=slice(1,2)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(1,2), ""x"": slice(1,2)}) ds.isel(time=slice(1,2), x=slice(2,3)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(1,2), ""x"": slice(2,3)}) ``` This example doesn't produce any error, but the `time` and `x` coordinates are re-written multiple times without any warning. However, I don't yet know how a proper error / warning should be generated in this case. Maybe the check must be if every written variable touches *all* region-ed dimensions? But maybe thats overly restrictive?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1052240616,https://api.github.com/repos/pydata/xarray/issues/6069,1052240616,IC_kwDOAMm_X84-t-ro,6574622,2022-02-26T15:58:48Z,2022-02-26T15:58:48Z,CONTRIBUTOR,"I'm trying to picture some usage scenarios based on incrementally adding timesteps to data on store. I hope these might help to answer questions from above. In particular, I think that `append` and `region` options of `to_zarr` will imply different usage patterns, so might lead to different answers, and mixing terms might lead to confusion. I'll use the following dataset for demonstration code: ```python ds = xr.Dataset({ ""T"": ((""time"", ""x""), [[1.,2.,3.],[11.,12.,13.]]), }, coords={ ""time"": ((""time"",), [21., 22.]), ""x"": ((""x"",), [100., 200., 300.]) }).chunk({""time"": 1}) ``` ## `append` The purpose of `append` is to add (one or many) elements along one dimension after the end of all currently existing elements. This implies a read-modify-write cycle to at least the total shape of the array. Furthermore, the place to write new chunks is determined by the current shape of the existing array. Due to these implications, it doesn't seem to be useful to try `append` in parallel (it would become ambiguous where to write) and it doesn't seem to be too useful (but possible) to only write *some* of the variables defined on the append-dimension, because all other variables would implicitly be filled with `fill_value` and those places couldn't be filled with another `append` anymore. As a consquence, append-mode writes will always have to be **sequential** and writes to objects shared touched by multiple append calls will always have a defined behaviour, even if they are modified / overwritten with each call. Creating and appending works as follows: ```python # writes 0-sized time-dimension, so only metadata and non-time dependent variables ds.isel(time=slice(0,0)).to_zarr(""test_append.zarr"") !tree -a test_append.zarr ds.isel(time=slice(0,1)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"") ds.isel(time=slice(1,2)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"") print() print(""final dataset:"") !tree -a test_append.zarr ```
Output ``` test_append.zarr ├── .zattrs ├── .zgroup ├── .zmetadata ├── T │ ├── .zarray │ └── .zattrs ├── time │ ├── .zarray │ └── .zattrs └── x ├── .zarray ├── .zattrs └── 0 3 directories, 10 files final dataset: test_append.zarr ├── .zattrs ├── .zgroup ├── .zmetadata ├── T │ ├── .zarray │ ├── .zattrs │ ├── 0.0 │ └── 1.0 ├── time │ ├── .zarray │ ├── .zattrs │ ├── 0 │ └── 1 └── x ├── .zarray ├── .zattrs └── 0 3 directories, 14 files ```
In this case, `x` would be overwritten with each append call, but the behaviour is well defined as we will only ever append sequentially, so whatever the last write writes into `x` will be the final result, e.g. `[1, 2, 3]` in the following case: ```python ds.isel(time=slice(0,1)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"") ds2 = ds.assign(x=[1,2,3]) ds2.isel(time=slice(1,2)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"") ``` If instead, `x` shouldn't be overwritten, it's possible to append using: ```python ds.drop([""x""]).isel(time=slice(0,1)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"") ds.drop([""x""]).isel(time=slice(1,2)).to_zarr(""test_append.zarr"", mode=""a"", append_dim=""time"") ``` This also works already with current `xarray` and has well defined behaviour. However, if there are many `time`-independent variables, it might be easier if something like `.drop_if_not(""time"")` or something similar would be available. ## `region` `region` behaves quite differently from `append`. It does not modify the shape of the arrays and it does not depend on the shape's value to determine where to write new data (it requires user input to do so). This generally enables **parallel** writes to the same dataset (if only distinct chunks are touched). But as metadata (e.g. shape) is still shared, updates to metadata must be done in a coordinated (likely sequential) manner. Generally, the workflow with `region` would imply writing the metadata once and maybe update it from time to time but sequentially (e.g. on a coordinating node) and write all the chunks in parallel on worker nodes, while carefully ensuring that no common chunks are overwritten. Let's see how this might look like: ```python ds.to_zarr(""test.zarr"", compute=False, encoding={""time"": {""chunks"": [1]}}) !rm test.zarr/time/0 !rm test.zarr/time/1 !tree -a test.zarr # NOTE: these may run in parallel (even if that's not useful in time, but region might also be in time and space) ds.drop(['x']).isel(time=slice(0,1)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(0,1)}) ds.drop(['x']).isel(time=slice(1,2)).to_zarr(""test.zarr"", mode=""r+"", region={""time"": slice(1,2)}) print() print(""final dataset:"") !tree -a test.zarr ```
Output ``` test.zarr ├── .zattrs ├── .zgroup ├── .zmetadata ├── T │ ├── .zarray │ └── .zattrs ├── time │ ├── .zarray │ └── .zattrs └── x ├── .zarray ├── .zattrs └── 0 3 directories, 10 files final dataset: test.zarr ├── .zattrs ├── .zgroup ├── .zmetadata ├── T │ ├── .zarray │ ├── .zattrs │ ├── 0.0 │ └── 1.0 ├── time │ ├── .zarray │ ├── .zattrs │ ├── 0 │ └── 1 └── x ├── .zarray ├── .zattrs └── 0 3 directories, 14 files ```
The above works and as far as I understand does what we'd want for parallel writes. It also avoids the mentioned ambiguous cases (due to the `drop(['x'])` statements). However this case is even more cumbersome to write than in the append case. The parallel writes might benefit from again from something like `.drop_if_not(""time"")` (which probably can't be optional in this case due to ambiguity). But what's even more problematic is the initial write of array metadata. In order to start building the dataset, I'll have to scaffold an (potentially not yet computed) Dataset of full size and use `compute=False` to write only metadata. However, this fails for coordinate variables (like time), because those are eagerly loaded and will still be written out. That's why I've removed those chunks in the example above. If `region` should be used for parallel append, then there must be some process on a coordinating node which updates the metadata keys (at least by increasing the shape). I don't yet see how that could be written nicely using xarray. --- So based on these two kinds of tasks, it seems to me that the actual `append` and `region` write-modes of `to_zarr` are already doing what they should do, but there could be some more convenience functions which would make those tasks much simpler: * some method like `drop_if_not` (maybe with a better name) which drops all the things we don't want to keep (maybe we should call it `keep` instead of `drop`). This method would essentially result in and simplify mode 1 in @shoyer's answer, which I'd argue is what we actually want in both use cases, becasue the dropped data would already have been written by the coordinating process. I'd believe that mode 1 shouldn't be the default for `to_zarr` though, because silently dropping data from being written isn't nice to the user. * some better tooling for writing and updating zarr dataset metadata (I don't know if that would fit in the realm of `xarray` though, as it looks like handling Datasets without content. For ""appending"" metadata, I really don't know how I'd picture this propery in `xarray` world.)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1034678675,https://api.github.com/repos/pydata/xarray/issues/6069,1034678675,IC_kwDOAMm_X849q_GT,9576982,2022-02-10T09:18:47Z,2022-02-10T09:18:47Z,NONE,"If Xarray/zarr is to replace netcdf, appending by time step is really an important feature Most (all?) numerical models will output results per time step onto a multidimensional grid with different variables Said grid will also have other parameters that will help rebuild the geometry or follow standards, like CF and Ugrid (The things that you are supposed to drop). The geometry of the grid is computed at the initialisation of the model. It is a bit counter intuitive to get rid of it for incremental backups especially that each write will not concern this part of the file. What I do at the moment is that I create a first dataset at the final dimension based on dummy dask arrays Export it `to_zarr` with` compute = False` With a buffer system, I create a new dataset for **each** buffer with the right data at the right place meaning only the time interval concerned and I write `to_zarr` with the region attribute I flush the buffer dataset after it being written. At the end I write all the parameters before closing the main dataset. To my knowledge, that's the only method which works.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1034196986,https://api.github.com/repos/pydata/xarray/issues/6069,1034196986,IC_kwDOAMm_X849pJf6,1217238,2022-02-09T21:12:31Z,2022-02-09T21:12:31Z,MEMBER,"The reason why this isn't allowed is because it's ambiguous what to do with the other variables that are not restricted to the region (['cell', 'face', 'layer', 'max_cell_node', 'max_face_nodes', 'node', 'siglay'] in this case). I can imagine quite a few different ways this behavior could be implemented: 1. Ignore these variables entirely. 2. Ignore variables if they also already exist, but write new ones. 3. Write or overwrite both new and existing these variables. 4. Write new variables. Ignore existing variables only if they already exist with the same values, and if not, raise an error. I believe your proposal here (removing these checks from `_validate_region`) would achieve (3), but I'm not sure that's the best option. (4) seems like perhaps the most user-friendly option, but checking existing variables can add significant overhead. When experimenting adding `region` support Xarray-Beam, I found many cases where it was easy to inadvertently make large parallel pipelines much slower by downloaded existing variables. The current solution is not to do any of these, and to force the user to make an explicit choice by dropping new variables, or write them in a separate call to `to_zarr`. I think it would also be OK to let a user explicitly opt-in to one of these behaviors, but I don't think guessing what the user wants would be ideal.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1033814820,https://api.github.com/repos/pydata/xarray/issues/6069,1033814820,IC_kwDOAMm_X849nsMk,43613877,2022-02-09T14:23:54Z,2022-02-09T14:36:48Z,CONTRIBUTOR,"You are right, the coordinates should not be dropped. I think the function [_validate_region](https://github.com/pydata/xarray/blob/39860f9bd3ed4e84a5d694adda10c82513ed519f/xarray/backends/api.py#L1244) has a bug. Currently it checks for all `ds.variables` if at least one of their dimensions agrees with the ones given in the region argument. However, `ds.variables` also returns the [coordinates](https://xarray.pydata.org/en/stable/generated/xarray.Dataset.variables.html). However, we actually only want to check if the `ds.data_vars` have a dimension intersecting with the given `region`. Changing the function to ```python def _validate_region(ds, region): if not isinstance(region, dict): raise TypeError(f""``region`` must be a dict, got {type(region)}"") for k, v in region.items(): if k not in ds.dims: raise ValueError( f""all keys in ``region`` are not in Dataset dimensions, got "" f""{list(region)} and {list(ds.dims)}"" ) if not isinstance(v, slice): raise TypeError( ""all values in ``region`` must be slice objects, got "" f""region={region}"" ) if v.step not in {1, None}: raise ValueError( ""step on all slices in ``region`` must be 1 or None, got "" f""region={region}"" ) non_matching_vars = [ k for k, v in ds.data_vars.items() if not set(region).intersection(v.dims) ] if non_matching_vars: raise ValueError( f""when setting `region` explicitly in to_zarr(), all "" f""variables in the dataset to write must have at least "" f""one dimension in common with the region's dimensions "" f""{list(region.keys())}, but that is not "" f""the case for some variables here. To drop these variables "" f""from this dataset before exporting to zarr, write: "" f"".drop({non_matching_vars!r})"" ) ``` seems to work.","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 1, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1032480933,https://api.github.com/repos/pydata/xarray/issues/6069,1032480933,IC_kwDOAMm_X849imil,9576982,2022-02-08T11:01:21Z,2022-02-08T11:01:21Z,NONE,"I don't get the second crash. It is not true that these variables are not in common, they are the coordinates of each of the variables. They are all made the same. This is a typical example of an unstructured grid backup. Meanwhile I found an alternate solution which is also better for memory management. I think the documentation example doesn't actually work. I will try to formulate my trick but that's not using this particular method of region that is not functioning as it should in my opinion.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208 https://github.com/pydata/xarray/issues/6069#issuecomment-1031773761,https://api.github.com/repos/pydata/xarray/issues/6069,1031773761,IC_kwDOAMm_X849f55B,43613877,2022-02-07T18:19:08Z,2022-02-07T18:19:08Z,CONTRIBUTOR,"Hi @Boorhin, I just ran into the same issue. The `region` argument has to be of type `slice`, in your case `slice(t)` instead of just `t` works: ```python import xarray as xr from datetime import datetime,timedelta import numpy as np dt= datetime.now() times= np.arange(dt,dt+timedelta(days=6), timedelta(hours=1)) nodesx,nodesy,layers=np.arange(10,50), np.arange(10,50)+15, np.arange(10) ds=xr.Dataset() ds.coords['time']=('time', times) ds.coords['node_x']=('node', nodesx) ds.coords['node_y']=('node', nodesy) ds.coords['layer']=('layer', layers) outfile='my_zarr' varnames=['potato','banana', 'apple'] for var in varnames: ds[var]=(('time', 'layer', 'node'), np.zeros((len(times), len(layers),len(nodesx)))) ds.to_zarr(outfile, mode='a') for t in range(len(times)): for var in varnames: ds[var].isel(time=slice(t)).values += np.random.random((len(layers),len(nodesx))) ds.isel(time=slice(t)).to_zarr(outfile, region={""time"": slice(t)}) ``` This leads however to another issue: ```python --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in 18 for var in varnames: 19 ds[var].isel(time=slice(t)).values += np.random.random((len(layers),len(nodesx))) ---> 20 ds.isel(time=slice(t)).to_zarr(outfile, region={""time"": slice(t)}) ~/.local/lib/python3.8/site-packages/xarray/core/dataset.py in to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks) 2029 encoding = {} 2030 -> 2031 return to_zarr( 2032 self, 2033 store=store, ~/.local/lib/python3.8/site-packages/xarray/backends/api.py in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks) 1359 1360 if region is not None: -> 1361 _validate_region(dataset, region) 1362 if append_dim is not None and append_dim in region: 1363 raise ValueError( ~/.local/lib/python3.8/site-packages/xarray/backends/api.py in _validate_region(ds, region) 1272 ] 1273 if non_matching_vars: -> 1274 raise ValueError( 1275 f""when setting `region` explicitly in to_zarr(), all "" 1276 f""variables in the dataset to write must have at least "" ValueError: when setting `region` explicitly in to_zarr(), all variables in the dataset to write must have at least one dimension in common with the region's dimensions ['time'], but that is not the case for some variables here. To drop these variables from this dataset before exporting to zarr, write: .drop(['node_x', 'node_y', 'layer']) ``` Here, the solution is however provided with the error message. Following the instructions, the snippet below finally works (as far as I can tell): ```python import xarray as xr from datetime import datetime,timedelta import numpy as np dt= datetime.now() times= np.arange(dt,dt+timedelta(days=6), timedelta(hours=1)) nodesx,nodesy,layers=np.arange(10,50), np.arange(10,50)+15, np.arange(10) ds=xr.Dataset() ds.coords['time']=('time', times) # ds.coords['node_x']=('node', nodesx) # ds.coords['node_y']=('node', nodesy) # ds.coords['layer']=('layer', layers) outfile='my_zarr' varnames=['potato','banana', 'apple'] for var in varnames: ds[var]=(('time', 'layer', 'node'), np.zeros((len(times), len(layers),len(nodesx)))) ds.to_zarr(outfile, mode='a') for t in range(len(times)): for var in varnames: ds[var].isel(time=slice(t)).values += np.random.random((len(layers),len(nodesx))) ds.isel(time=slice(t)).to_zarr(outfile, region={""time"": slice(t)}) ``` Maybe one would like to generalise `region` in `api.py` to allow for single indices or throw a hint in case an a type different to a slice is provided. Cheers","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1077079208