id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1966264258,I_kwDOAMm_X851Ms_C,8385,The method to_netcdf does not preserve chunks,40218891,open,0,,,3,2023-10-27T22:29:45Z,2023-10-31T18:51:45Z,,NONE,,,,"### What happened? Methods ``to_zarr`` and ``to_netcdf`` behave inconsistently for chunked dataset. The latter does not preserve existing chunk information, the chunks must be specified within the ``encoding`` dictionary. ### What did you expect to happen? I expected the behaviour to be consistent for for all ``to_XXX()`` methods. ### Minimal Complete Verifiable Example ```Python import xarray as xr import dask.array as da rng = da.random.RandomState() shape = (20, 20) chunks = [10, 10] dims = [""x"", ""y""] z = rng.standard_normal(shape, chunks=chunks) ds = xr.DataArray(z, dims=dims, name=""z"").to_dataset() ds.chunks # This one is rechunked ds.to_netcdf(""/tmp/test1.nc"", encoding={""z"": {""chunksizes"": (5, 5)}}) # This one is not rechunked, also original chunks are lost ds.chunk({""x"": 5, ""y"": 5}).to_netcdf(""/tmp/test2.nc"") # This one is rechunked ds.chunk({""x"": 5, ""y"": 5}).to_zarr(""/tmp/test2"", mode=""w"") Frozen({'x': (10, 10), 'y': (10, 10)}) xr.open_mfdataset(""/tmp/test1.nc"").chunks xr.open_mfdataset(""/tmp/test2.nc"").chunks xr.open_mfdataset(""/tmp/test2"", engine=""zarr"").chunks Frozen({'x': (5, 5, 5, 5), 'y': (5, 5, 5, 5)}) Frozen({'x': (20,), 'y': (20,)}) Frozen({'x': (5, 5, 5, 5), 'y': (5, 5, 5, 5)}) ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. - [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies. ### Relevant log output _No response_ ### Anything else we need to know? I did get the same results for ``h5netcdf`` and ``scipy`` backends, so I am not sure whether this is a bug or not. The above code is a modified version of #2198. A suggestion: the documentation provides only examples of encoding styles. It would be helpful to provide links to a full specification. ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.5-1-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.10.1 pandas: 2.1.1 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.4 pydap: None h5netcdf: 1.2.0 h5py: 3.10.0 Nio: None zarr: 2.16.1 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.10.0 distributed: 2023.10.0 matplotlib: 3.8.0 cartopy: 0.22.0 seaborn: None numbagg: 0.5.1 fsspec: 2023.10.0 cupy: None pint: None sparse: 0.14.0 flox: 0.8.1 numpy_groupies: 0.10.2 setuptools: 68.2.2 pip: 23.3.1 conda: None pytest: None mypy: None IPython: 8.16.1 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8385/reactions"", ""total_count"": 3, ""+1"": 3, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 1953059418,I_kwDOAMm_X850aVJa,8345,`.stack` produces large chunks,40218891,closed,0,,,4,2023-10-19T21:09:56Z,2023-10-26T21:20:05Z,2023-10-26T21:20:05Z,NONE,,,,"### What happened? Xarray ``stack`` does not chunk along the last coordinate, producing huge chunks, as described in #5754. Dask, seeing code like this: ``` da2 = da.stack(new=(""z"", ""t"")).groupby(""new"").map(sum).unstack(""new"") ``` produces warning and suggestion to use context manager: ``` with dask.config.set(**{""array.slicing.split_large_chunks"": True}): da2 = da.stack(new=(""z"", ""t"")).groupby(""new"").map(sum).unstack(""new"") ``` This fails with message ``IndexError: tuple index out of range``. ### What did you expect to happen? I expect this to work. #5754 is closed. ### Minimal Complete Verifiable Example ```Python import dask.array import numpy as np import xarray as xr var = xr.Variable( (""t"", ""z"", ""u"", ""x"", ""y""), dask.array.random.random((1200, 4, 2, 1000, 100), chunks=(1, 1, -1, -1, -1)), ) da = xr.DataArray(var) def sum(ds): return ds.sum(dim=""u"") with dask.config.set(**{""array.slicing.split_large_chunks"": True}): da2 = da.stack(new=(""z"", ""t"")).groupby(""new"").map(sum).unstack(""new"") da2 ``` ### MVCE confirmation - [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [ ] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [ ] New issue — a search of GitHub Issues suggests this is not a duplicate. - [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies. ### Relevant log output ```Python --------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[21], line 5 2 return ds.sum(dim=""u"") 4 with dask.config.set(**{""array.slicing.split_large_chunks"": True}): ----> 5 da2 = da.stack(new=(""z"", ""t"")).groupby(""new"").map(sum).unstack(""new"") 6 da2 File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/core/dataarray.py:2855, in DataArray.unstack(self, dim, fill_value, sparse) 2795 def unstack( 2796 self, 2797 dim: Dims = None, 2798 fill_value: Any = dtypes.NA, 2799 sparse: bool = False, 2800 ) -> Self: 2801 """""" 2802 Unstack existing dimensions corresponding to MultiIndexes into 2803 multiple new dimensions. (...) 2853 DataArray.stack 2854 """""" -> 2855 ds = self._to_temp_dataset().unstack(dim, fill_value, sparse) 2856 return self._from_temp_dataset(ds) File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/core/dataset.py:5500, in Dataset.unstack(self, dim, fill_value, sparse) 5498 for d in dims: 5499 if needs_full_reindex: -> 5500 result = result._unstack_full_reindex( 5501 d, stacked_indexes[d], fill_value, sparse 5502 ) 5503 else: 5504 result = result._unstack_once(d, stacked_indexes[d], fill_value, sparse) File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/core/dataset.py:5395, in Dataset._unstack_full_reindex(self, dim, index_and_vars, fill_value, sparse) 5393 if name not in index_vars: 5394 if dim in var.dims: -> 5395 variables[name] = var.unstack({dim: new_dim_sizes}) 5396 else: 5397 variables[name] = var File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/core/variable.py:1930, in Variable.unstack(self, dimensions, **dimensions_kwargs) 1928 result = self 1929 for old_dim, dims in dimensions.items(): -> 1930 result = result._unstack_once_full(dims, old_dim) 1931 return result File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/core/variable.py:1820, in Variable._unstack_once_full(self, dims, old_dim) 1817 reordered = self.transpose(*dim_order) 1819 new_shape = reordered.shape[: len(other_dims)] + new_dim_sizes -> 1820 new_data = reordered.data.reshape(new_shape) 1821 new_dims = reordered.dims[: len(other_dims)] + new_dim_names 1823 return type(self)( 1824 new_dims, new_data, self._attrs, self._encoding, fastpath=True 1825 ) File ~/mambaforge/envs/icec/lib/python3.11/site-packages/dask/array/core.py:2219, in Array.reshape(self, merge_chunks, limit, *shape) 2217 if len(shape) == 1 and not isinstance(shape[0], Number): 2218 shape = shape[0] -> 2219 return reshape(self, shape, merge_chunks=merge_chunks, limit=limit) File ~/mambaforge/envs/icec/lib/python3.11/site-packages/dask/array/reshape.py:285, in reshape(x, shape, merge_chunks, limit) 283 else: 284 chunk_plan.append(""auto"") --> 285 outchunks = normalize_chunks( 286 chunk_plan, 287 shape=shape, 288 limit=limit, 289 dtype=x.dtype, 290 previous_chunks=inchunks, 291 ) 293 x2 = x.rechunk(inchunks) 295 # Construct graph File ~/mambaforge/envs/icec/lib/python3.11/site-packages/dask/array/core.py:3095, in normalize_chunks(chunks, shape, limit, dtype, previous_chunks) 3092 chunks = tuple(""auto"" if isinstance(c, str) and c != ""auto"" else c for c in chunks) 3094 if any(c == ""auto"" for c in chunks): -> 3095 chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks) 3097 if shape is not None: 3098 chunks = tuple(c if c not in {None, -1} else s for c, s in zip(chunks, shape)) File ~/mambaforge/envs/icec/lib/python3.11/site-packages/dask/array/core.py:3218, in auto_chunks(chunks, shape, limit, dtype, previous_chunks) 3212 largest_block = math.prod( 3213 cs if isinstance(cs, Number) else max(cs) for cs in chunks if cs != ""auto"" 3214 ) 3216 if previous_chunks: 3217 # Base ideal ratio on the median chunk size of the previous chunks -> 3218 result = {a: np.median(previous_chunks[a]) for a in autos} 3220 ideal_shape = [] 3221 for i, s in enumerate(shape): File ~/mambaforge/envs/icec/lib/python3.11/site-packages/dask/array/core.py:3218, in (.0) 3212 largest_block = math.prod( 3213 cs if isinstance(cs, Number) else max(cs) for cs in chunks if cs != ""auto"" 3214 ) 3216 if previous_chunks: 3217 # Base ideal ratio on the median chunk size of the previous chunks -> 3218 result = {a: np.median(previous_chunks[a]) for a in autos} 3220 ideal_shape = [] 3221 for i, s in enumerate(shape): IndexError: tuple index out of range ``` ### Anything else we need to know? The most recent traceback entry point to an issue in dask code. ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.5-1-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.9.0 pandas: 2.1.1 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.4 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.16.1 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.9.3 distributed: 2023.9.3 matplotlib: 3.8.0 cartopy: 0.22.0 seaborn: None numbagg: None fsspec: 2023.9.2 cupy: None pint: None sparse: 0.14.0 flox: 0.7.2 numpy_groupies: 0.10.2 setuptools: 68.2.2 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.16.1 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8345/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1956383344,I_kwDOAMm_X850nApw,8358,Writing to zarr archive fails on resampled dataset,40218891,closed,0,,,1,2023-10-23T05:30:36Z,2023-10-23T15:46:20Z,2023-10-23T15:46:19Z,NONE,,,,"### What happened? I am not sure where this belongs: xarray, dask or zarr. When a dataset is resampled to a semi-monthly frequency, the method ``to_zarr`` complains about invalid chunks. ### What did you expect to happen? I think this should work without having to rechunk the result before writing to the archive. ### Minimal Complete Verifiable Example ```Python time = pd.date_range(""2001-01-01"", freq=""D"", periods=365) ds = xr.Dataset({""foo"": (""time"", np.arange(1, 366)), ""time"": time}).chunk(time=5) dsr = ds.resample(time=""SM"").mean() dsr.to_zarr('/tmp/foo', mode='w') ``` ### MVCE confirmation - [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [ ] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. - [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies. ### Relevant log output ```Python --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[63], line 4 2 ds = xr.Dataset({""foo"": (""time"", np.arange(1, 366)), ""time"": time}).chunk(time=5) 3 dsr = ds.resample(time=""SM"").mean() ----> 4 dsr.to_zarr('/tmp/foo', mode='w') 5 #dsr.isel(time=slice(0, -1)).to_zarr('/tmp/foo', mode='w') File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/core/dataset.py:2490, in Dataset.to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options, zarr_version, write_empty_chunks, chunkmanager_store_kwargs) 2358 """"""Write dataset contents to a zarr group. 2359 2360 Zarr chunks are determined in the following way: (...) 2486 The I/O user guide, with more details and examples. 2487 """""" 2488 from xarray.backends.api import to_zarr -> 2490 return to_zarr( # type: ignore[call-overload,misc] 2491 self, 2492 store=store, 2493 chunk_store=chunk_store, 2494 storage_options=storage_options, 2495 mode=mode, 2496 synchronizer=synchronizer, 2497 group=group, 2498 encoding=encoding, 2499 compute=compute, 2500 consolidated=consolidated, 2501 append_dim=append_dim, 2502 region=region, 2503 safe_chunks=safe_chunks, 2504 zarr_version=zarr_version, 2505 write_empty_chunks=write_empty_chunks, 2506 chunkmanager_store_kwargs=chunkmanager_store_kwargs, 2507 ) File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/backends/api.py:1708, in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options, zarr_version, write_empty_chunks, chunkmanager_store_kwargs) 1706 writer = ArrayWriter() 1707 # TODO: figure out how to properly handle unlimited_dims -> 1708 dump_to_store(dataset, zstore, writer, encoding=encoding) 1709 writes = writer.sync( 1710 compute=compute, chunkmanager_store_kwargs=chunkmanager_store_kwargs 1711 ) 1713 if compute: File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/backends/api.py:1308, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims) 1305 if encoder: 1306 variables, attrs = encoder(variables, attrs) -> 1308 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/backends/zarr.py:631, in ZarrStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims) 628 self.set_attributes(attributes) 629 self.set_dimensions(variables_encoded, unlimited_dims=unlimited_dims) --> 631 self.set_variables( 632 variables_encoded, check_encoding_set, writer, unlimited_dims=unlimited_dims 633 ) 634 if self._consolidate_on_close: 635 zarr.consolidate_metadata(self.zarr_group.store) File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/backends/zarr.py:687, in ZarrStore.set_variables(self, variables, check_encoding_set, writer, unlimited_dims) 684 zarr_array = self.zarr_group[name] 685 else: 686 # new variable --> 687 encoding = extract_zarr_variable_encoding( 688 v, raise_on_invalid=check, name=vn, safe_chunks=self._safe_chunks 689 ) 690 encoded_attrs = {} 691 # the magic for storing the hidden dimension data File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/backends/zarr.py:281, in extract_zarr_variable_encoding(variable, raise_on_invalid, name, safe_chunks) 278 if k not in valid_encodings: 279 del encoding[k] --> 281 chunks = _determine_zarr_chunks( 282 encoding.get(""chunks""), variable.chunks, variable.ndim, name, safe_chunks 283 ) 284 encoding[""chunks""] = chunks 285 return encoding File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/backends/zarr.py:138, in _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name, safe_chunks) 132 raise ValueError( 133 ""Zarr requires uniform chunk sizes except for final chunk. "" 134 f""Variable named {name!r} has incompatible dask chunks: {var_chunks!r}. "" 135 ""Consider rechunking using `chunk()`."" 136 ) 137 if any((chunks[0] < chunks[-1]) for chunks in var_chunks): --> 138 raise ValueError( 139 ""Final chunk of Zarr array must be the same size or smaller "" 140 f""than the first. Variable named {name!r} has incompatible Dask chunks {var_chunks!r}."" 141 ""Consider either rechunking using `chunk()` or instead deleting "" 142 ""or modifying `encoding['chunks']`."" 143 ) 144 # return the first chunk for each dimension 145 return tuple(chunk[0] for chunk in var_chunks) ValueError: Final chunk of Zarr array must be the same size or smaller than the first. Variable named 'foo' has incompatible Dask chunks ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2),).Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`. ``` ### Anything else we need to know? I can also achieve what I want without having to rechunk with ``` dsr = ds.resample(time=""SM"", closed=""right"", label=""right"").mean().isel(time=slice(0, -1)) ``` ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.5-1-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.10.1 pandas: 2.1.1 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.4 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.16.1 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.10.0 distributed: 2023.10.0 matplotlib: 3.8.0 cartopy: 0.22.0 seaborn: None numbagg: 0.5.1 fsspec: 2023.10.0 cupy: None pint: None sparse: 0.14.0 flox: 0.8.1 numpy_groupies: 0.10.2 setuptools: 68.2.2 pip: 23.3.1 conda: None pytest: None mypy: None IPython: 8.16.1 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8358/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1940650207,I_kwDOAMm_X85zq_jf,8300,Inconsistent behaviour of xarray.concat,40218891,closed,0,,,2,2023-10-12T19:23:32Z,2023-10-12T19:48:01Z,2023-10-12T19:48:00Z,NONE,,,,"### What is your issue? I am not sure if it is a bug or a feature: ``` import numpy as np import pandas as pd import xarray as xr temp = 15 + 8 * np.random.randn(2, 2, 2) lon = [[-99.83, -99.32], [-99.79, -99.23]] lat = [[42.25, 42.21], [42.63, 42.59]] ds = xr.Dataset( {""temperature"": ([""x"", ""y"", ""time""], temp), ""latitude_longitude"": 0}, coords={ ""lon"": ([""x"", ""y""], lon), ""lat"": ([""x"", ""y""], lat), ""time"": (""time"", pd.date_range(""2014-09-05"", periods=2)), }, ) print( xr.concat( [ds.isel(time=0), ds.isel(time=1)], ""time"", data_vars=""minimal"" ).latitude_longitude ) print( xr.concat( [ds.isel(time=slice(0, 1)), ds.isel(time=slice(1, 2))], ""time"", data_vars=""minimal"" ).latitude_longitude ) ``` I expected the output to be the same. It appears that ``data_vars=""minimal""`` has no effect when the concatenation dimension does not exist. ``` array([0, 0]) Coordinates: * time (time) datetime64[ns] 2014-09-05 2014-09-06 array(0) ``` The documentation states: ``` These data variables will be concatenated together: “minimal”: Only data variables in which the dimension already appears are included. ``` BTW, this is xarray 2023.9.0.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8300/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1806386948,I_kwDOAMm_X85rq0cE,7990,Random crashes in netcdf when dask client has multiple threads,40218891,closed,0,,,1,2023-07-16T01:00:55Z,2023-08-23T00:18:18Z,2023-08-23T00:18:17Z,NONE,,,,"### What happened? The data files can be found here: https://noaadata.apps.nsidc.org/NOAA/G02202_V4/north/monthly/. The example code below crashes randomly: the file processed when the crash occurs differs between runs. This happens only when ``threads_per_worker`` is > 1 in the ``client()`` call . ``n_workers`` does not matter, at least I could not make it to crash. The traceback points to hdf5. ### What did you expect to happen? _No response_ ### Minimal Complete Verifiable Example ```Python from pathlib import Path import pandas as pd from dask.distributed import Client import xarray as xr client = Client(n_workers=1, threads_per_worker=4) DATADIR = Path(""/mnt/sdc1/icec/NSIDC"") year = 2020 times = pd.date_range(f""{year}-01-01"", f""{year}-12-01"", freq=""MS"", name=""time"") paths = [ DATADIR / ""monthly"" / f""seaice_conc_monthly_nh_{t.strftime('%Y%m')}_f17_v04r00.nc"" for t in times ] for n in range(10): ds = xr.open_mfdataset( paths, combine=""nested"", concat_dim=""tdim"", parallel=True, engine=""netcdf4"", ) del ds HDF5-DIAG: Error detected in HDF5 (1.14.0) thread 0: #000: H5G.c line 442 in H5Gopen2(): unable to synchronously open group major: Symbol table minor: Unable to create file #001: H5G.c line 399 in H5G__open_api_common(): can't set object access arguments major: Symbol table minor: Can't set value #002: H5VLint.c line 2669 in H5VL_setup_acc_args(): invalid location identifier major: Invalid arguments to routine minor: Inappropriate type #003: H5VLint.c line 1787 in H5VL_vol_object(): invalid identifier type to function major: Invalid arguments to routine minor: Inappropriate type HDF5-DIAG: Error detected in HDF5 (1.14.0) thread 0: #000: H5G.c line 887 in H5Gclose(): not a group ID major: Invalid arguments to routine minor: Inappropriate type 2023-07-16 00:35:47,833 - distributed.worker - WARNING - Compute Failed Key: open_dataset-09a155bb-5079-406a-83c4-737933c409c7 Function: execute_task args: ((, , ['/mnt/sdc1/icec/NSIDC/monthly/seaice_conc_monthly_nh_202001_f17_v04r00.nc'], (, [['engine', 'netcdf4'], ['chunks', (, [])]]))) kwargs: {} Exception: ""OSError(-101, 'NetCDF: HDF error')"" 2023-07-16 00:35:47,834 - distributed.worker - WARNING - Compute Failed Key: open_dataset-14e239f4-7e16-4891-a350-b55979d4a754 Function: execute_task args: ((, , ['/mnt/sdc1/icec/NSIDC/monthly/seaice_conc_monthly_nh_202011_f17_v04r00.nc'], (, [['engine', 'netcdf4'], ['chunks', (, [])]]))) kwargs: {} Exception: ""OSError(-101, 'NetCDF: HDF error')"" --------------------------------------------------------------------------- OSError Traceback (most recent call last) Cell In[1], line 19 14 paths = [ 15 DATADIR / ""monthly"" / f""seaice_conc_monthly_nh_{t.strftime('%Y%m')}_f17_v04r00.nc"" 16 for t in times 17 ] 18 for n in range(10): ---> 19 ds = xr.open_mfdataset( 20 paths, 21 combine=""nested"", 22 concat_dim=""tdim"", 23 parallel=True, 24 engine=""netcdf4"", 25 ) 26 del ds File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/api.py:1050, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs) 1045 datasets = [preprocess(ds) for ds in datasets] 1047 if parallel: 1048 # calling compute here will return the datasets/file_objs lists, 1049 # the underlying datasets will still be stored as dask arrays -> 1050 datasets, closers = dask.compute(datasets, closers) 1052 # Combine all datasets, closing them in case of a ValueError 1053 try: File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/api.py:570, in open_dataset() 558 decoders = _resolve_decoders_kwargs( 559 decode_cf, 560 open_backend_dataset_parameters=backend.open_dataset_parameters, (...) 566 decode_coords=decode_coords, 567 ) 569 overwrite_encoded_chunks = kwargs.pop(""overwrite_encoded_chunks"", None) --> 570 backend_ds = backend.open_dataset( 571 filename_or_obj, 572 drop_variables=drop_variables, 573 **decoders, 574 **kwargs, 575 ) 576 ds = _dataset_from_backend_dataset( 577 backend_ds, 578 filename_or_obj, (...) 588 **kwargs, 589 ) 590 return ds File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:590, in open_dataset() 569 def open_dataset( # type: ignore[override] # allow LSP violation, not supporting **kwargs 570 self, 571 filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore, (...) 587 autoclose=False, 588 ) -> Dataset: 589 filename_or_obj = _normalize_path(filename_or_obj) --> 590 store = NetCDF4DataStore.open( 591 filename_or_obj, 592 mode=mode, 593 format=format, 594 group=group, 595 clobber=clobber, 596 diskless=diskless, 597 persist=persist, 598 lock=lock, 599 autoclose=autoclose, 600 ) 602 store_entrypoint = StoreBackendEntrypoint() 603 with close_on_error(store): File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:391, in open() 385 kwargs = dict( 386 clobber=clobber, diskless=diskless, persist=persist, format=format 387 ) 388 manager = CachingFileManager( 389 netCDF4.Dataset, filename, mode=mode, kwargs=kwargs 390 ) --> 391 return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose) File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:338, in __init__() 336 self._group = group 337 self._mode = mode --> 338 self.format = self.ds.data_model 339 self._filename = self.ds.filepath() 340 self.is_remote = is_remote_uri(self._filename) File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:400, in ds() 398 @property 399 def ds(self): --> 400 return self._acquire() File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:394, in _acquire() 393 def _acquire(self, needs_lock=True): --> 394 with self._manager.acquire_context(needs_lock) as root: 395 ds = _nc4_require_group(root, self._group, self._mode) 396 return ds File ~/mambaforge/envs/icec/lib/python3.10/contextlib.py:135, in __enter__() 133 del self.args, self.kwds, self.func 134 try: --> 135 return next(self.gen) 136 except StopIteration: 137 raise RuntimeError(""generator didn't yield"") from None File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/file_manager.py:199, in acquire_context() 196 @contextlib.contextmanager 197 def acquire_context(self, needs_lock=True): 198 """"""Context manager for acquiring a file."""""" --> 199 file, cached = self._acquire_with_cache_info(needs_lock) 200 try: 201 yield file File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/file_manager.py:217, in _acquire_with_cache_info() 215 kwargs = kwargs.copy() 216 kwargs[""mode""] = self._mode --> 217 file = self._opener(*self._args, **kwargs) 218 if self._mode == ""w"": 219 # ensure file doesn't get overridden when opened again 220 self._mode = ""a"" File src/netCDF4/_netCDF4.pyx:2464, in netCDF4._netCDF4.Dataset.__init__() File src/netCDF4/_netCDF4.pyx:2027, in netCDF4._netCDF4._ensure_nc_success() OSError: [Errno -101] NetCDF: HDF error: '/mnt/sdc1/icec/NSIDC/monthly/seaice_conc_monthly_nh_202011_f17_v04r00.nc' ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. ### Relevant log output _No response_ ### Anything else we need to know? _No response_ ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.1.38-1-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.0 libnetcdf: 4.9.2 xarray: 2023.6.0 pandas: 2.0.3 numpy: 1.24.4 scipy: 1.11.1 netCDF4: 1.6.4 pydap: None h5netcdf: None h5py: 3.9.0 Nio: None zarr: 2.15.0 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.7.0 distributed: 2023.7.0 matplotlib: 3.7.1 cartopy: 0.21.1 seaborn: None numbagg: None fsspec: 2023.6.0 cupy: None pint: None sparse: 0.14.0 flox: None numpy_groupies: None setuptools: 68.0.0 pip: 23.2 conda: None pytest: None mypy: None IPython: 8.14.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7990/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 787947436,MDU6SXNzdWU3ODc5NDc0MzY=,4822,h5netcdf fails to decode attribute coordinates.,40218891,closed,0,,,10,2021-01-18T06:01:40Z,2022-03-29T13:39:46Z,2022-03-29T13:39:45Z,NONE,,,," **What happened**: The engine ``h5netcdf`` fail to decode attribute *coordinates*. **What you expected to happen**: It should work. **Minimal Complete Verifiable Example**: ```python # Put your MCVE code here import xarray as xr ds = xr.open_dataset('/tmp/x.nc', engine='h5netcdf') ========H5 coordinates ['x y'] --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 1 import xarray as xr 2 ----> 3 ds = xr.open_dataset('/tmp/x.nc', engine='h5netcdf') ~/miniconda3/envs/aws/lib/python3.7/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime, decode_timedelta) 572 573 with close_on_error(store): --> 574 ds = maybe_decode_store(store, chunks) 575 576 # Ensure source filename always stored in dataset object (GH issue #2550) ~/miniconda3/envs/aws/lib/python3.7/site-packages/xarray/backends/api.py in maybe_decode_store(store, chunks) 476 drop_variables=drop_variables, 477 use_cftime=use_cftime, --> 478 decode_timedelta=decode_timedelta, 479 ) 480 ~/miniconda3/envs/aws/lib/python3.7/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta) 596 drop_variables=drop_variables, 597 use_cftime=use_cftime, --> 598 decode_timedelta=decode_timedelta, 599 ) 600 ds = Dataset(vars, attrs=attrs) ~/miniconda3/envs/aws/lib/python3.7/site-packages/xarray/conventions.py in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta) 504 if ""coordinates"" in var_attrs: 505 coord_str = var_attrs[""coordinates""] --> 506 var_coord_names = coord_str.split() 507 if all(k in variables for k in var_coord_names): 508 new_vars[k].encoding[""coordinates""] = coord_str AttributeError: 'numpy.ndarray' object has no attribute 'split' ``` **Anything else we need to know?**: The test file was created from CDL: ``` netcdf x { dimensions: x = 1 ; y = 1 ; variables: int foo(y, x) ; string foo:coordinates = ""x y"" ; data: foo = 0 ; } ``` The line ``========H5 coordinates ['x y']`` comes from me adding print statement on line 56 in function *_read_attributes*, file *api/h5netcdf.py*. Obviously the problem is caused by the attribute being a list instead of a string, as it is when *netcdf4* is used. **Environment**:
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.7 (default, Mar 23 2020, 22:36:06) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 5.9.12-200.fc33.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.1 xarray: 0.16.2 pandas: 1.2.0 numpy: 1.19.2 scipy: 1.5.2 netCDF4: 1.4.2 pydap: None h5netcdf: 0.8.1 h5py: 2.10.0 Nio: None zarr: None cftime: 1.3.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.01.0 distributed: 2021.01.0 matplotlib: 3.2.1 cartopy: 0.17.0 seaborn: None numbagg: None pint: None setuptools: 51.1.2.post20210112 pip: 20.3.3 conda: None pytest: 5.4.3 IPython: 7.19.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4822/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 863477424,MDU6SXNzdWU4NjM0Nzc0MjQ=,5199,"Better error message for setting encoding[""units""] ",40218891,closed,0,,,0,2021-04-21T05:57:14Z,2021-05-13T18:27:13Z,2021-05-13T18:27:13Z,NONE,,,," **What happened**: Setting invalid units for time axis encoding results in an exception ``AttributeError: 'NoneType' object has no attribute 'groups'`` **What you expected to happen**: It should say ""invalid time units"", like this (see commented out line below) ``ValueError: invalid time units: days after 1/3/2000`` **Minimal Complete Verifiable Example**: ```python # Put your MCVE code here import pandas as pd import xarray as xr ds = xr.Dataset(data_vars={'v': (('t',), [0,])}, coords={'t': [pd.Timestamp(2000, 1, 1)]}) ds.t.encoding['units'] = 'days since Big Bang' #ds.t.encoding['units'] = 'days after 1/3/2000' ds.to_netcdf('/tmp/x.nc', mode='w') --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 5 ds.t.encoding['units'] = 'days since Big Bang' 6 #ds.t.encoding['units'] = 'days after 1/3/2000' ----> 7 ds.to_netcdf('/tmp/x.nc', mode='w') ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf) 1752 from ..backends.api import to_netcdf 1753 -> 1754 return to_netcdf( 1755 self, 1756 path, ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf) 1066 # TODO: allow this work (setting up the file for writing array data) 1067 # to be parallelized with dask -> 1068 dump_to_store( 1069 dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims 1070 ) ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims) 1113 variables, attrs = encoder(variables, attrs) 1114 -> 1115 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) 1116 1117 ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims) 261 writer = ArrayWriter() 262 --> 263 variables, attributes = self.encode(variables, attributes) 264 265 self.set_attributes(attributes) ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/backends/common.py in encode(self, variables, attributes) 350 # All NetCDF files get CF encoded by default, without this attempting 351 # to write times, for example, would fail. --> 352 variables, attributes = cf_encoder(variables, attributes) 353 variables = {k: self.encode_variable(v) for k, v in variables.items()} 354 attributes = {k: self.encode_attribute(v) for k, v in attributes.items()} ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/conventions.py in cf_encoder(variables, attributes) 841 _update_bounds_encoding(variables) 842 --> 843 new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} 844 845 # Remove attrs from bounds variables (issue #2921) ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/conventions.py in (.0) 841 _update_bounds_encoding(variables) 842 --> 843 new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} 844 845 # Remove attrs from bounds variables (issue #2921) ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/conventions.py in encode_cf_variable(var, needs_copy, name) 267 variables.UnsignedIntegerCoder(), 268 ]: --> 269 var = coder.encode(var, name=name) 270 271 # TODO(shoyer): convert all of these to use coders, too: ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/coding/times.py in encode(self, variable, name) 510 variable 511 ): --> 512 (data, units, calendar) = encode_cf_datetime( 513 data, encoding.pop(""units"", None), encoding.pop(""calendar"", None) 514 ) ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/coding/times.py in encode_cf_datetime(dates, units, calendar) 448 units = infer_datetime_units(dates) 449 else: --> 450 units = _cleanup_netcdf_time_units(units) 451 452 if calendar is None: ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/coding/times.py in _cleanup_netcdf_time_units(units) 399 400 def _cleanup_netcdf_time_units(units): --> 401 delta, ref_date = _unpack_netcdf_time_units(units) 402 try: 403 units = ""{} since {}"".format(delta, format_timestamp(ref_date)) ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/coding/times.py in _unpack_netcdf_time_units(units) 130 131 delta_units, ref_date = [s.strip() for s in matches.groups()] --> 132 ref_date = _ensure_padded_year(ref_date) 133 134 return delta_units, ref_date ~/miniconda3/envs/xarray/lib/python3.9/site-packages/xarray-0.0.0-py3.9.egg/xarray/coding/times.py in _ensure_padded_year(ref_date) 105 # appropriately 106 matches_start_digits = re.match(r""(\d+)(.*)"", ref_date) --> 107 ref_year, everything_else = [s for s in matches_start_digits.groups()] 108 ref_date_padded = ""{:04d}{}"".format(int(ref_year), everything_else) 109 AttributeError: 'NoneType' object has no attribute 'groups' ``` **Anything else we need to know?**: Are there detail specifications for the valid units string? Setting it to 1/3/2000 surprised me (my locale: LANG=en_CA.UTF-8). **Environment**: Latest code from github: xarray version 0.0.0
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.10.11-200.fc33.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.8.0 xarray: 0.0.0 pandas: 1.2.4 numpy: 1.20.2 scipy: None netCDF4: 1.5.6 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.7.1 cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.04.0 distributed: 2021.04.0 matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20210108 pip: 21.0.1 conda: None pytest: None IPython: 7.22.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5199/reactions"", ""total_count"": 3, ""+1"": 3, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 849751721,MDU6SXNzdWU4NDk3NTE3MjE=,5106,to_zarr() fails on time coordinate in append mode,40218891,closed,0,,,4,2021-04-03T22:26:11Z,2021-04-20T12:04:06Z,2021-04-20T04:41:07Z,NONE,,,," **What happened**: When the append dimension coordinates are times and the dimension of the first dataset written is 1, consecutive appends forget the hour part of the coordinate. **What you expected to happen**: The time coordinate should be set correctly. **Minimal Complete Verifiable Example**: ``` import pandas as pd import xarray as xr reftime = [pd.Timestamp(2021, 2, 21, 0)] x = [0] dims = ('reftime', 'x') d = np.array([['A']]) ds1 = xr.Dataset(data_vars={'v': (dims, d)}, coords={'reftime': reftime, 'x': x}) _ = ds1.to_zarr('foo', mode='w') reftime = [pd.Timestamp(2021, 2, 21, 6)] d = np.array([['C']]) ds2 = xr.Dataset(data_vars={'v': (dims, d)}, coords={'reftime': reftime, 'x': x}) _ = ds2.to_zarr('foo', append_dim='reftime') ds = xr.open_dataset('foo', engine='zarr') ds.coords['reftime'].values array(['2021-02-21T00:00:00.000000000', '2021-02-21T00:00:00.000000000'], # should be 2021-02-21T06:00:00.000000000 dtype='datetime64[ns]') ``` **Anything else we need to know?**: When the `reftime` coordinate in the first dataset has dimension 2, the output is correct: ``` import pandas as pd import xarray as xr reftime = [pd.Timestamp(2021, 2, 21, 0), pd.Timestamp(2021, 2, 21, 3)] x = [0] dims = ('reftime', 'x') d = np.array([['A'], ['B']]) ds1 = xr.Dataset(data_vars={'v': (dims, d)}, coords={'reftime': reftime, 'x': x}) _ = ds1.to_zarr('foo', mode='w') reftime = [pd.Timestamp(2021, 2, 21, 6)] d = np.array([['C']]) ds2 = xr.Dataset(data_vars={'v': (dims, d)}, coords={'reftime': reftime, 'x': x}) _ = ds2.to_zarr('foo', append_dim='reftime') ds = xr.open_dataset('foo', engine='zarr') ds.coords['reftime'].values array(['2021-02-21T00:00:00.000000000', '2021-02-21T03:00:00.000000000', '2021-02-21T06:00:00.000000000'], dtype='datetime64[ns]') ``` Increment of a full day works fine: ``` import pandas as pd import xarray as xr reftime = [pd.Timestamp(2021, 2, 21, 0)] x = [0] dims = ('reftime', 'x') d = np.array([['A']]) ds1 = xr.Dataset(data_vars={'v': (dims, d)}, coords={'reftime': reftime, 'x': x}) _ = ds1.to_zarr('foo', mode='w') reftime = [pd.Timestamp(2021, 2, 22, 0)] d = np.array([['C']]) ds2 = xr.Dataset(data_vars={'v': (dims, d)}, coords={'reftime': reftime, 'x': x}) _ = ds2.to_zarr('foo', append_dim='reftime') ds = xr.open_dataset('foo', engine='zarr') ds.coords['reftime'].values array(['2021-02-21T00:00:00.000000000', '2021-02-22T00:00:00.000000000'], dtype='datetime64[ns]') ``` **Environment**:
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.10.11-200.fc33.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.17.0 pandas: 1.2.3 numpy: 1.20.2 scipy: 1.6.2 netCDF4: 1.5.6 pydap: None h5netcdf: 0.10.0 h5py: 3.1.0 Nio: None zarr: 2.7.0 cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.8.5 iris: None bottleneck: 1.3.2 dask: 2021.04.0 distributed: 2021.04.0 matplotlib: 3.4.1 cartopy: None seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20210108 pip: 21.0.1 conda: None pytest: None IPython: 7.22.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5106/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 789653499,MDU6SXNzdWU3ODk2NTM0OTk=,4830,GH2550 revisited,40218891,open,0,,,2,2021-01-20T05:40:16Z,2021-01-25T23:06:01Z,,NONE,,,," **Is your feature request related to a problem? Please describe.** I am retrieving files from AWS: https://registry.opendata.aws/wrf-se-alaska-snap/. An example: ``` import s3fs import xarray as xr s3 = s3fs.S3FileSystem(anon=True) s3path = 's3://wrf-se-ak-ar5/gfdl/hist/daily/1980/WRFDS_1980-01-0[12].nc' remote_files = s3.glob(s3path) fileset = [s3.open(file) for file in remote_files] ds = xr.open_mfdataset(fileset, concat_dim='Time', decode_cf=False) ds ``` Data files for 1980 are missing time coordinate, so the above code fails. The time could be obtained by parsing file name, however in the current implementation the *source* attribute is available only when the fileset consists of strings or *Path*s. **Describe the solution you'd like** I would suggest to return to the original suggestion in #2550 - pass *filename_or_object* as an argument to *preprocess* function, but with necessary inspection. Here is my attempt (code in *open_mfdataset*): ``` open_kwargs = dict( engine=engine, chunks=chunks or {}, lock=lock, autoclose=autoclose, **kwargs ) if preprocess is not None: # Get number of free arguments from inspect import signature parms = signature(preprocess).parameters num_preprocess_args = len([p for p in parms.values() if p.default == p.empty]) if num_preprocess_args not in (1, 2): raise ValueError('preprocess accepts only 1 or 2 arguments') if parallel: import dask # wrap the open_dataset, getattr, and preprocess with delayed open_ = dask.delayed(open_dataset) getattr_ = dask.delayed(getattr) if preprocess is not None: preprocess = dask.delayed(preprocess) else: open_ = open_dataset getattr_ = getattr datasets = [open_(p, **open_kwargs) for p in paths] file_objs = [getattr_(ds, ""_file_obj"") for ds in datasets] if preprocess is not None: if num_preprocess_args == 1: datasets = [preprocess(ds) for ds in datasets] else: datasets = [preprocess(ds, p) for (ds, p) in zip(datasets, paths)] ``` With this, I can define function *fix* as follows: ``` def fix(ds, source): vtime = datetime.strptime(os.path.basename(source.path), 'WRFDS_%Y-%m-%d.nc') return ds.assign_coords(Time=[vtime]) ds = xr.open_mfdataset(fileset, preprocess=fix, concat_dim='Time', decode_cf=False) ``` This is backward compatible, *preprocess* can accept any number of arguments: ``` from functools import partial import xarray as xr def fix1(ds): print('fix1') return ds def fix2(ds, file): print('fix2:', file.as_uri()) return ds def fix3(ds, file, arg): print('fix3:', file.as_uri(), arg) return ds fileset = [Path('/home/george/Downloads/WRFDS_1988-04-23.nc'), Path('/home/george/Downloads/WRFDS_1988-04-24.nc') ] ds = xr.open_mfdataset(fileset, preprocess=fix1, concat_dim='Time', parallel=True) ds = xr.open_mfdataset(fileset, preprocess=fix2, concat_dim='Time') ds = xr.open_mfdataset(fileset, preprocess=partial(fix3, arg='additional argument'), concat_dim='Time') ``` ``` fix1 fix1 fix2: file:///home/george/Downloads/WRFDS_1988-04-23.nc fix2: file:///home/george/Downloads/WRFDS_1988-04-24.nc fix3: file:///home/george/Downloads/WRFDS_1988-04-23.nc additional argument fix3: file:///home/george/Downloads/WRFDS_1988-04-24.nc additional argument ``` **Describe alternatives you've considered** The simple solution would be to make xarray s3fs aware. IMHO this is not particularly elegant. Either a check for an attribute, or an import within a *try/except* block would be needed. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4830/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 429914958,MDU6SXNzdWU0Mjk5MTQ5NTg=,2871,xr.open_dataset(f1).to_netcdf(file2) is not idempotent,40218891,closed,0,,,5,2019-04-05T20:06:35Z,2019-06-12T15:32:27Z,2019-06-12T15:32:27Z,NONE,,,,"Here is the original (much truncated) file. ``` > ncdump ak.nc netcdf ak { dimensions: npts = UNLIMITED ; // (2 currently) ntimes = 4 ; variables: short tmpk(npts, ntimes) ; tmpk:description = ""2 m Temperature - closest to top of hour"" ; tmpk:units = ""K"" ; tmpk:level = ""2 m"" ; tmpk:period_variable = ""ntimes1"" ; tmpk:missing_value = -9999s ; tmpk:scale_factor = 0.01 ; // global attributes: :source = ""ak-obs"" ; data: tmpk = 26915, 27755, -9999, 27705, 25595, -9999, 28315, -9999 ; } ``` Python code: ``` ds = xr.open_dataset('ak.nc') ds.to_netcdf('akbad.nc') ds ds['tmpk'] Dimensions: (npts: 2, ntimes: 4) Dimensions without coordinates: npts, ntimes Data variables: tmpk (npts, ntimes) float32 ... Attributes: source: ak-obs array([[269.15, 277.55, nan, 277.05], [255.95, nan, 283.15, nan]], dtype=float32) Dimensions without coordinates: npts, ntimes Attributes: description: 2 m Temperature - closest to top of hour units: K level: 2 m period_variable: ntimes1 ``` File written to disk: ``` > ncdump akbad.nc netcdf akbad { dimensions: npts = UNLIMITED ; // (2 currently) ntimes = 4 ; variables: short tmpk(npts, ntimes) ; tmpk:description = ""2 m Temperature - closest to top of hour"" ; tmpk:units = ""K"" ; tmpk:level = ""2 m"" ; tmpk:period_variable = ""ntimes1"" ; tmpk:scale_factor = 0.01 ; // global attributes: :source = ""ak-obs"" ; data: tmpk = 26915, 27755, 0, 27705, 25595, 0, 28315, 0 ; } ``` To confuse matter more, I am getting a warning: ``` SerializationWarning: saving variable tmpk with floating point data as an integer dtype without any _FillValue to use for NaNs ``` This might make sense, since `tmpk` after decoding has datatype `float32`, however somehow the original variable type and `scale_factor` are preserved, but `missing_value` attribute disappears and value written back is wrong: 0. I want to write back `tmpk` as float number and let zlib worry about disk space. xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.1 | packaged by conda-forge | (default, Feb 18 2019, 01:42:00) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-957.5.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.12.0 pandas: 0.24.2 numpy: 1.15.4 scipy: 1.2.1 netCDF4: 1.5.0 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 1.1.4 distributed: 1.26.0 matplotlib: 3.0.3 cartopy: 0.17.0 seaborn: 0.9.0 setuptools: 40.8.0 pip: 19.0.3 conda: 4.6.10 pytest: 4.3.1 IPython: 7.4.0 sphinx: 1.8.5 ​","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2871/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue