id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
1902108672,I_kwDOAMm_X85xX-AA,8207,Getting `NETCDF: HDF error` while writing a NetCDF file opened using `open_mfdataset`,50383939,open,0,,,4,2023-09-19T02:44:02Z,2023-12-01T22:29:49Z,,NONE,,,,"### What is your issue?
I am simply reading 366 small (~15MBs) NetCDF files to create one big NetCDF file at the end. Below is the relevant workflow:
```python-console
In [1]: import os; import dask
In [2]: import xarray as xr
In [3]: from dask.distributed import Client, LocalCluster
In [4]: cluster = LocalCluster(n_workers=4, threads_per_worker=1) # 1 core to each worker
In [5]: client = Client(cluster)
In [6]: os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'
In [7]: ds = xr.open_mfdataset('./remapped/*.nc', chunks={'COMID': 1400}, parallel=True)
In [8]: ds.to_netcdf('./out2.nc')
```
And below, is the error I am getting:
Error message
```python-console
In [8]: ds.to_netcdf('./out2.nc')
/home/kasra545/virtual-envs/meshflow/lib/python3.10/site-packages/distributed/client.py:3149: UserWarning: Sending large graph of size 9.97 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
warnings.warn(
2023-09-18 22:26:14,279 - distributed.worker - WARNING - Compute Failed
Key: ('open_dataset-concatenate-concatenate-be7dd534c459e2f316d9149df2d9ec95', 178, 0)
Function: getter
args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyIndexedArray(array=_ElementwiseFunctionArray(LazilyIndexedArray(array=, key=BasicIndexer((slice(None, None, None), slice(None, None, None)))), func=functools.partial(, encoded_fill_values={-9999.0}, decoded_fill_value=nan, dtype=dtype('float64')), dtype=dtype('float64')), key=BasicIndexer((slice(None, None, None), slice(None, None, None)))))), (slice(0, 24, None), slice(0, 1400, None)))
kwargs: {}
Exception: ""RuntimeError('NetCDF: HDF error')""
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[8], line 1
----> 1 ds.to_netcdf('./out2.nc')
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/dataset.py:2252, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
2249 encoding = {}
2250 from xarray.backends.api import to_netcdf
-> 2252 return to_netcdf( # type: ignore # mypy cannot resolve the overloads:(
2253 self,
2254 path,
2255 mode=mode,
2256 format=format,
2257 group=group,
2258 engine=engine,
2259 encoding=encoding,
2260 unlimited_dims=unlimited_dims,
2261 compute=compute,
2262 multifile=False,
2263 invalid_netcdf=invalid_netcdf,
2264 )
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/api.py:1255, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
1252 if multifile:
1253 return writer, store
-> 1255 writes = writer.sync(compute=compute)
1257 if isinstance(target, BytesIO):
1258 store.sync()
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/common.py:256, in ArrayWriter.sync(self, compute, chunkmanager_store_kwargs)
253 if chunkmanager_store_kwargs is None:
254 chunkmanager_store_kwargs = {}
--> 256 delayed_store = chunkmanager.store(
257 self.sources,
258 self.targets,
259 lock=self.lock,
260 compute=compute,
261 flush=True,
262 regions=self.regions,
263 **chunkmanager_store_kwargs,
264 )
265 self.sources = []
266 self.targets = []
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/daskmanager.py:211, in DaskManager.store(self, sources, targets, **kwargs)
203 def store(
204 self,
205 sources: DaskArray | Sequence[DaskArray],
206 targets: Any,
207 **kwargs,
208 ):
209 from dask.array import store
--> 211 return store(
212 sources=sources,
213 targets=targets,
214 **kwargs,
215 )
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/dask/array/core.py:1236, in store(***failed resolving arguments***)
1234 elif compute:
1235 store_dsk = HighLevelGraph(layers, dependencies)
-> 1236 compute_as_if_collection(Array, store_dsk, map_keys, **kwargs)
1237 return None
1239 else:
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/dask/base.py:369, in compute_as_if_collection(cls, dsk, keys, scheduler, get, **kwargs)
367 schedule = get_scheduler(scheduler=scheduler, cls=cls, get=get)
368 dsk2 = optimization_function(cls)(dsk, keys, **kwargs)
--> 369 return schedule(dsk2, keys, **kwargs)
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/distributed/client.py:3267, in Client.get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
3265 should_rejoin = False
3266 try:
-> 3267 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
3268 finally:
3269 for f in futures.values():
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/distributed/client.py:2393, in Client.gather(self, futures, errors, direct, asynchronous)
2390 local_worker = None
2392 with shorten_traceback():
-> 2393 return self.sync(
2394 self._gather,
2395 futures,
2396 errors=errors,
2397 direct=direct,
2398 local_worker=local_worker,
2399 asynchronous=asynchronous,
2400 )
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:484, in __array__()
483 def __array__(self, dtype: np.typing.DTypeLike = None) -> np.ndarray:
--> 484 return np.asarray(self.get_duck_array(), dtype=dtype)
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:487, in get_duck_array()
486 def get_duck_array(self):
--> 487 return self.array.get_duck_array()
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:664, in get_duck_array()
663 def get_duck_array(self):
--> 664 return self.array.get_duck_array()
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:557, in get_duck_array()
552 # self.array[self.key] is now a numpy array when
553 # self.array is a BackendArray subclass
554 # and self.key is BasicIndexer((slice(None, None, None),))
555 # so we need the explicit check for ExplicitlyIndexed
556 if isinstance(array, ExplicitlyIndexed):
--> 557 array = array.get_duck_array()
558 return _wrap_numpy_scalars(array)
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/coding/variables.py:74, in get_duck_array()
73 def get_duck_array(self):
---> 74 return self.func(self.array.get_duck_array())
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:551, in get_duck_array()
550 def get_duck_array(self):
--> 551 array = self.array[self.key]
552 # self.array[self.key] is now a numpy array when
553 # self.array is a BackendArray subclass
554 # and self.key is BasicIndexer((slice(None, None, None),))
555 # so we need the explicit check for ExplicitlyIndexed
556 if isinstance(array, ExplicitlyIndexed):
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:100, in __getitem__()
99 def __getitem__(self, key):
--> 100 return indexing.explicit_indexing_adapter(
101 key, self.shape, indexing.IndexingSupport.OUTER, self._getitem
102 )
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:858, in explicit_indexing_adapter()
836 """"""Support explicit indexing by delegating to a raw indexing method.
837
838 Outer and/or vectorized indexers are supported by indexing a second time
(...)
855 Indexing result, in the form of a duck numpy-array.
856 """"""
857 raw_key, numpy_indices = decompose_indexer(key, shape, indexing_support)
--> 858 result = raw_indexing_method(raw_key.tuple)
859 if numpy_indices.tuple:
860 # index the loaded np.ndarray
861 result = NumpyIndexingAdapter(result)[numpy_indices]
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:112, in _getitem()
110 try:
111 with self.datastore.lock:
--> 112 original_array = self.get_array(needs_lock=False)
113 array = getitem(original_array, key)
114 except IndexError:
115 # Catch IndexError in netCDF4 and return a more informative
116 # error message. This is most often called when an unsorted
117 # indexer is used before the data is loaded from disk.
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:91, in get_array()
90 def get_array(self, needs_lock=True):
---> 91 ds = self.datastore._acquire(needs_lock)
92 variable = ds.variables[self.variable_name]
93 variable.set_auto_maskandscale(False)
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:403, in _acquire()
402 def _acquire(self, needs_lock=True):
--> 403 with self._manager.acquire_context(needs_lock) as root:
404 ds = _nc4_require_group(root, self._group, self._mode)
405 return ds
File /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/contextlib.py:135, in __enter__()
133 del self.args, self.kwds, self.func
134 try:
--> 135 return next(self.gen)
136 except StopIteration:
137 raise RuntimeError(""generator didn't yield"") from None
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/file_manager.py:199, in acquire_context()
196 @contextlib.contextmanager
197 def acquire_context(self, needs_lock=True):
198 """"""Context manager for acquiring a file.""""""
--> 199 file, cached = self._acquire_with_cache_info(needs_lock)
200 try:
201 yield file
File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/file_manager.py:217, in _acquire_with_cache_info()
215 kwargs = kwargs.copy()
216 kwargs[""mode""] = self._mode
--> 217 file = self._opener(*self._args, **kwargs)
218 if self._mode == ""w"":
219 # ensure file doesn't get overridden when opened again
220 self._mode = ""a""
File src/netCDF4/_netCDF4.pyx:2487, in netCDF4._netCDF4.Dataset.__init__()
File src/netCDF4/_netCDF4.pyx:1928, in netCDF4._netCDF4._get_vars()
File src/netCDF4/_netCDF4.pyx:2029, in netCDF4._netCDF4._ensure_nc_success()
RuntimeError: NetCDF: HDF error
```
The header of individual NetCDF ones are also in the following:
Individual NetCDF header
```console
$ ncdump -h ab_models_remapped_1980-04-20-13-00-00.nc
netcdf ab_models_remapped_1980-04-20-13-00-00 {
dimensions:
COMID = 14980 ;
time = UNLIMITED ; // (24 currently)
variables:
int time(time) ;
time:long_name = ""time"" ;
time:units = ""hours since 1980-04-20 12:00:00"" ;
time:calendar = ""gregorian"" ;
time:standard_name = ""time"" ;
time:axis = ""T"" ;
double latitude(COMID) ;
latitude:long_name = ""latitude"" ;
latitude:units = ""degrees_north"" ;
latitude:standard_name = ""latitude"" ;
double longitude(COMID) ;
longitude:long_name = ""longitude"" ;
longitude:units = ""degrees_east"" ;
longitude:standard_name = ""longitude"" ;
double COMID(COMID) ;
COMID:long_name = ""shape ID"" ;
COMID:units = ""1"" ;
double RDRS_v2.1_P_P0_SFC(time, COMID) ;
RDRS_v2.1_P_P0_SFC:_FillValue = -9999. ;
RDRS_v2.1_P_P0_SFC:long_name = ""Forecast: Surface pressure"" ;
RDRS_v2.1_P_P0_SFC:units = ""mb"" ;
double RDRS_v2.1_P_HU_1.5m(time, COMID) ;
RDRS_v2.1_P_HU_1.5m:_FillValue = -9999. ;
RDRS_v2.1_P_HU_1.5m:long_name = ""Forecast: Specific humidity"" ;
RDRS_v2.1_P_HU_1.5m:units = ""kg kg**-1"" ;
double RDRS_v2.1_P_TT_1.5m(time, COMID) ;
RDRS_v2.1_P_TT_1.5m:_FillValue = -9999. ;
RDRS_v2.1_P_TT_1.5m:long_name = ""Forecast: Air temperature"" ;
RDRS_v2.1_P_TT_1.5m:units = ""deg_C"" ;
double RDRS_v2.1_P_UVC_10m(time, COMID) ;
RDRS_v2.1_P_UVC_10m:_FillValue = -9999. ;
RDRS_v2.1_P_UVC_10m:long_name = ""Forecast: Wind Modulus (derived using UU and VV)"" ;
RDRS_v2.1_P_UVC_10m:units = ""kts"" ;
double RDRS_v2.1_A_PR0_SFC(time, COMID) ;
RDRS_v2.1_A_PR0_SFC:_FillValue = -9999. ;
RDRS_v2.1_A_PR0_SFC:long_name = ""Analysis: Quantity of precipitation"" ;
RDRS_v2.1_A_PR0_SFC:units = ""m"" ;
double RDRS_v2.1_P_FB_SFC(time, COMID) ;
RDRS_v2.1_P_FB_SFC:_FillValue = -9999. ;
RDRS_v2.1_P_FB_SFC:long_name = ""Forecast: Downward solar flux"" ;
RDRS_v2.1_P_FB_SFC:units = ""W m**-2"" ;
double RDRS_v2.1_P_FI_SFC(time, COMID) ;
RDRS_v2.1_P_FI_SFC:_FillValue = -9999. ;
RDRS_v2.1_P_FI_SFC:long_name = ""Forecast: Surface incoming infrared flux"" ;
RDRS_v2.1_P_FI_SFC:units = ""W m**-2"" ;
```
I am running `xarray` and `Dask` on an HPC, so the ""modules"" I have loaded are the following:
```console
module list
Currently Loaded Modules:
1) CCconfig 6) ucx/1.8.0 11) netcdf-mpi/4.9.0 (io) 16) freexl/1.0.5 (t) 21) scipy-stack/2023a (math)
2) gentoo/2020 (S) 7) libfabric/1.10.1 12) hdf5-mpi/1.12.1 (io) 17) geos/3.10.2 (geo) 22) libspatialindex/1.8.5 (phys)
3) gcccore/.9.3.0 (H) 8) openmpi/4.0.3 (m) 13) libffi/3.3 18) librttopo-proj9/1.1.0 23) ipykernel/2023a
4) imkl/2020.1.217 (math) 9) StdEnv/2020 (S) 14) python/3.10.2 (t) 19) proj/9.0.1 (geo) 24) sqlite/3.38.5
5) intel/2020.1.217 (t) 10) mii/1.1.2 15) mpi4py/3.1.3 (t) 20) libspatialite-proj901/5.0.1
```
Any suggestion is greatly appreciated!","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8207/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1878016712,I_kwDOAMm_X85v8ELI,8137,`time` variable encoding changes upon using `to_netcdf` method on a `DataSet`,50383939,open,0,,,2,2023-09-01T20:34:58Z,2023-09-15T05:32:15Z,,NONE,,,,"### What is your issue?
Upon trying to use the `to_netcdf` method of the `Dataset`, the encoding (local attributes) of the `time` variable changes. More specifically, the `units` has changed into another format. Here is a reproducible example:
```python-console
$ ipython
Python 3.10.2 (main, Feb 4 2022, 19:10:35) [GCC 9.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import xarray as xr
imp
In [2]: import numpy as np
In [3]: import pandas as pd
In [4]: np.random.seed(0)
...: temperature = 15 + 8 * np.random.randn(2, 2, 25)
...: precipitation = 10 * np.random.rand(2, 2, 25)
...: lon = [[-99.83, -99.32], [-99.79, -99.23]]
...: lat = [[42.25, 42.21], [42.63, 42.59]]
...: time = pd.date_range(""2014-09-06"", ""2014-09-07"",freq='H')
...: reference_time = pd.Timestamp(""2014-09-05"")
In [5]: ds = xr.Dataset(
...: data_vars=dict(
...: temperature=([""x"", ""y"", ""time""], temperature),
...: precipitation=([""x"", ""y"", ""time""], precipitation),
...: ),
...: coords=dict(
...: lon=([""x"", ""y""], lon),
...: lat=([""x"", ""y""], lat),
...: time=time,
...: reference_time=reference_time,
...: ),
...: attrs=dict(description=""Weather related data.""),
...: )
...: ds
Out[5]:
Dimensions: (x: 2, y: 2, time: 25)
Coordinates:
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
lat (x, y) float64 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 2014-09-06 ... 2014-09-07
reference_time datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
temperature (x, y, time) float64 29.11 18.2 22.83 ... 29.29 16.02 18.22
precipitation (x, y, time) float64 4.239 6.064 0.1919 ... 8.727 2.735 7.98
Attributes:
description: Weather related data.
In [6]: ds.time
Out[6]:
array(['2014-09-06T00:00:00.000000000', '2014-09-06T01:00:00.000000000',
'2014-09-06T02:00:00.000000000', '2014-09-06T03:00:00.000000000',
'2014-09-06T04:00:00.000000000', '2014-09-06T05:00:00.000000000',
'2014-09-06T06:00:00.000000000', '2014-09-06T07:00:00.000000000',
'2014-09-06T08:00:00.000000000', '2014-09-06T09:00:00.000000000',
'2014-09-06T10:00:00.000000000', '2014-09-06T11:00:00.000000000',
'2014-09-06T12:00:00.000000000', '2014-09-06T13:00:00.000000000',
'2014-09-06T14:00:00.000000000', '2014-09-06T15:00:00.000000000',
'2014-09-06T16:00:00.000000000', '2014-09-06T17:00:00.000000000',
'2014-09-06T18:00:00.000000000', '2014-09-06T19:00:00.000000000',
'2014-09-06T20:00:00.000000000', '2014-09-06T21:00:00.000000000',
'2014-09-06T22:00:00.000000000', '2014-09-06T23:00:00.000000000',
'2014-09-07T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2014-09-06 ... 2014-09-07
reference_time datetime64[ns] 2014-09-05
In [7]: ds.time.encoding
Out[7]: {}
In [9]: ds.to_netcdf(""./test.nc"", encoding={'time': {'units': 'hours since 2014-09-01 12:00:00'}})
In [10]: !ncdump -h ./test.nc
netcdf test {
dimensions:
x = 2 ;
y = 2 ;
time = 25 ;
variables:
double temperature(x, y, time) ;
temperature:_FillValue = NaN ;
temperature:coordinates = ""lat lon reference_time"" ;
double precipitation(x, y, time) ;
precipitation:_FillValue = NaN ;
precipitation:coordinates = ""lat lon reference_time"" ;
double lon(x, y) ;
lon:_FillValue = NaN ;
double lat(x, y) ;
lat:_FillValue = NaN ;
int64 time(time) ;
time:units = ""hours since 2014-09-01T12:00:00"" ; <------- this is the problem
time:calendar = ""proleptic_gregorian"" ;
int64 reference_time ;
reference_time:units = ""days since 2014-09-05 00:00:00"" ;
reference_time:calendar = ""proleptic_gregorian"" ;
// global attributes:
:description = ""Weather related data."" ;
}
In [11]: ds.info()
xarray.Dataset {
dimensions:
x = 2 ;
y = 2 ;
time = 25 ;
variables:
float64 temperature(x, y, time) ;
float64 precipitation(x, y, time) ;
float64 lon(x, y) ;
float64 lat(x, y) ;
datetime64[ns] time(time) ;
datetime64[ns] reference_time() ;
// global attributes:
:description = Weather related data. ;
}
```
The only thing that I am concerned about is the `T` value in the ` ""hours since 2014-09-01T12:00:00"" ` string in the final netCDF file. I would like to have control over it, however, even by providing an encoding dictionary for the `units` attribute, the `T` is placed in the attribute string.
The sample dataset is taken from here: https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html
How may I evade this issue? Any suggestions. I did my best to Google.
Thanks.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8137/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue