html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/7574#issuecomment-1460907454,https://api.github.com/repos/pydata/xarray/issues/7574,1460907454,IC_kwDOAMm_X85XE62-,127195910,2023-03-08T21:34:49Z,2023-03-15T16:54:13Z,NONE,"@jonas-constellr It's possible that the failure you're experiencing is due to an issue with how the h5netcdf library is interacting with Dask. One potential solution to this issue is to try using the netCDF4 library instead of h5netcdf. netCDF4 is another popular library for reading and writing netCDF files, and it has built-in support for parallel I/O through Dask. To use netCDF4 with xarray, you can simply pass the 'netcdf4' engine to the xr.open_mfdataset function: python import xarray as xr # Open multiple netCDF files with netCDF4 engine and parallel I/O ds = xr.open_mfdataset('path/to/files/*.nc', engine='netcdf4', parallel=True) If you need to use h5netcdf for some reason, another potential solution is to use the dask.array.from_delayed function to manually create a Dask array from the h5netcdf data. This can be done by first reading in the data using h5netcdf, and then using dask.delayed to parallelize the data loading across multiple chunks. Here's an example: python import h5netcdf import dask.array as da from dask import delayed # Define function to read in a single chunk of data from the netCDF file @delayed def read_chunk(filename, varname, start, count): with h5netcdf.File(filename, 'r') as f: var = f[varname][start[0]:start[0]+count[0], start[1]:start[1]+count[1]] return var # Define function to read in the entire dataset using dask.array.from_delayed def read_data(files, varname): chunks = (1000, 1000) # Define chunk size data = [read_chunk(f, varname, start, chunks) for f in files] data = [da.from_delayed(d, shape=chunks, dtype='float64') for d in data] data = da.concatenate(data, axis=0) return data # Open multiple netCDF files with h5netcdf engine and parallel I/O files = ['path/to/files/file1.nc', 'path/to/files/file2.nc', ...] varname = 'my_variable' data = read_data(files, varname) This code reads in the data from each file in chunks, and returns a Dask array that is a concatenation of all the chunks. The read_chunk function uses h5netcdf.File to read in a single chunk of data from a file, and returns a delayed object that represents the loading of that chunk. The read_data function uses dask.delayed to parallelize the loading of the chunks across all the files, and then uses dask.array.from_delayed to create a Dask array from the delayed objects. Finally, the function returns the concatenated Dask array.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1605108888 https://github.com/pydata/xarray/issues/7574#issuecomment-1460903268,https://api.github.com/repos/pydata/xarray/issues/7574,1460903268,IC_kwDOAMm_X85XE51k,127195910,2023-03-08T21:30:53Z,2023-03-15T16:53:58Z,NONE,"> ### What happened? > > > > I was trying to read multiple byte netcdf (requires h5netcdf engine) file with xr.open_mfdataset with parallel=True to leverage dask.delayed capabilities (parallel=False works though) but it failed. > > > > The netcdf files were noaa-goes16 satellite images, but I can't tell if it matters. > > > > ### What did you expect to happen? > > > > It should have loaded all the netcdf files into a xarray.DataSet object > > > > ### Minimal Complete Verifiable Example > > > > ``` > > > > Python > > import fsspec > > import xarray as xr > > > > paths = [ > > 's3://noaa-goes16/ABI-L2-LSTC/2022/185/03/OR_ABI-L2-LSTC-M6_G16_s20221850301180_e20221850303553_c20221850305091.nc', > > 's3://noaa-goes16/ABI-L2-LSTC/2022/185/02/OR_ABI-L2-LSTC-M6_G16_s20221850201180_e20221850203553_c20221850205142.nc' > > ] > > > > fs = fsspec.filesystem('s3') > > > > xr.open_mfdataset( > > [fs.open(path, mode=""rb"") for path in paths], > > engine=""h5netcdf"", > > combine=""nested"", > > concat_dim=""t"", > > parallel=True > > ) > > > > ``` > > > > > > > > ### MVCE confirmation > > > > - [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. > > - [X] Complete example — the example is self-contained, including all data and the text of any traceback. > > - [ ] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. > > - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. > > > > ### Relevant log output > > > > ```Python > > -------------------------------------------------------------------------- > > KeyError Traceback (most recent call last) > > File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/file_manager.py:210, in CachingFileManager._acquire_with_cache_info(self, needs_lock) > > 209 try: > > --> 210 file = self._cache[self._key] > > 211 except KeyError: > > > > File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/lru_cache.py:56, in LRUCache.__getitem__(self, key) > > 55 with self._lock: > > ---> 56 value = self._cache[key] > > 57 self._cache.move_to_end(key) > > > > KeyError: [, ((b'\x89HDF\r\n', b'\x1a\n', b'\x02\x08\x08\x00\x00\x00 ... EXTREMELY STRING ... 00\x00\x00\x00\x00\x00\x0ef'] > > > > During handling of the above exception, another exception occurred: > > > > TypeError Traceback (most recent call last) > > Cell In[9], line 11 > > 4 paths = [ > > 5 's3://noaa-goes16/ABI-L2-LSTC/2022/185/03/OR_ABI-L2-LSTC-M6_G16_s20221850301180_e20221850303553_c20221850305091.nc', > > 6 's3://noaa-goes16/ABI-L2-LSTC/2022/185/02/OR_ABI-L2-LSTC-M6_G16_s20221850201180_e20221850203553_c20221850205142.nc' > > 7 ] > > 9 fs = fsspec.filesystem('s3') > > ---> 11 xr.open_mfdataset( > > 12 [fs.open(path, mode=""rb"") for path in paths], > > 13 engine=""h5netcdf"", > > 14 combine=""nested"", > > 15 concat_dim=""t"", > > 16 parallel=True > > 17 ).LST > > > > File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/api.py:991, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs) > > 986 datasets = [preprocess(ds) for ds in datasets] > > 988 if parallel: > > 989 # calling compute here will return the datasets/file_objs lists, > > 990 # the underlying datasets will still be stored as dask arrays > > --> 991 datasets, closers = dask.compute(datasets, closers) > > 993 # Combine all datasets, closing them in case of a ValueError > > 994 try: > > > > File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/base.py:599, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs) > > 596 keys.append(x.__dask_keys__()) > > 597 postcomputes.append(x.__dask_postcompute__()) > > --> 599 results = schedule(dsk, keys, **kwargs) > > 600 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)]) > > > > File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/threaded.py:89, in get(dsk, keys, cache, num_workers, pool, **kwargs) > > 86 elif isinstance(pool, multiprocessing.pool.Pool): > > 87 pool = MultiprocessingPoolExecutor(pool) > > ---> 89 results = get_async( > > 90 pool.submit, > > 91 pool._max_workers, > > 92 dsk, > > 93 keys, > > 94 cache=cache, > > 95 get_id=_thread_get_id, > > 96 pack_exception=pack_exception, > > 97 **kwargs, > > 98 ) > > 100 # Cleanup pools associated to dead threads > > 101 with pools_lock: > > > > File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:511, in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs) > > 509 _execute_task(task, data) # Re-execute locally > > 510 else: > > --> 511 raise_exception(exc, tb) > > 512 res, worker_id = loads(res_info) > > 513 state[""cache""][key] = res > > > > File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:319, in reraise(exc, tb) > > 317 if exc.__traceback__ is not tb: > > 318 raise exc.with_traceback(tb) > > --> 319 raise exc > > > > File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:224, in execute_task(key, task_info, dumps, loads, get_id, pack_exception) > > 222 try: > > 223 task, data = loads(task_info) > > --> 224 result = _execute_task(task, data) > > 225 id = get_id() > > 226 result = dumps((result, id)) > > > > File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk) > > 115 func, args = arg[0], arg[1:] > > 116 # Note: Don't assign the subtask results to a variable. numpy detects > > 117 # temporaries by their reference count and can execute certain > > 118 # operations in-place. > > --> 119 return func(*(_execute_task(a, cache) for a in args)) > > 120 elif not ishashable(arg): > > 121 return arg > > > > File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/utils.py:73, in apply(func, args, kwargs) > > 42 """"""Apply a function given its positional and keyword arguments. > > 43 > > 44 Equivalent to ``func(*args, **kwargs)`` > > (...) > > ... > > ---> 19 filename = fspath(filename) > > 20 if sys.platform == ""win32"": > > 21 if isinstance(filename, str): > > > > TypeError: expected str, bytes or os.PathLike object, not tuple > > ``` > > > > > > ### Anything else we need to know? > > > > _No response_ > > > > ### Environment > > > >
> > INSTALLED VERSIONS > > ------------------ > > commit: None > > python: 3.11.0 | packaged by conda-forge | (main, Jan 15 2023, 05:44:48) [Clang 14.0.6 ] > > python-bits: 64 > > OS: Darwin > > OS-release: 21.6.0 > > machine: x86_64 > > processor: i386 > > byteorder: little > > LC_ALL: None > > LANG: None > > LOCALE: (None, 'UTF-8') > > libhdf5: 1.12.2 > > libnetcdf: None > > > > xarray: 2023.2.0 > > pandas: 1.5.3 > > numpy: 1.24.2 > > scipy: 1.10.1 > > netCDF4: None > > pydap: None > > h5netcdf: 1.1.0 > > h5py: 3.8.0 > > Nio: None > > zarr: None > > cftime: None > > nc_time_axis: None > > PseudoNetCDF: None > > rasterio: 1.3.6 > > cfgrib: None > > iris: None > > bottleneck: None > > dask: 2023.2.1 > > distributed: 2023.2.1 > > matplotlib: 3.7.0 > > cartopy: 0.21.1 > > seaborn: 0.12.2 > > numbagg: None > > fsspec: 2023.1.0 > > cupy: None > > pint: None > > sparse: None > > flox: None > > numpy_groupies: None > > setuptools: 67.4.0 > > pip: 23.0.1 > > conda: None > > pytest: 7.2.1 > > mypy: None > > IPython: 8.10.0 > > sphinx: None > > [/Users/jo/miniconda3/envs/rxr/lib/python3.11/site-packages/_distutils_hack/__init__.py:33](https://file+.vscode-resource.vscode-cdn.net/Users/jo/miniconda3/envs/rxr/lib/python3.11/site-packages/_distutils_hack/__init__.py:33): UserWarning: Setuptools is replacing distutils. > > warnings.warn(""Setuptools is replacing distutils."") > >
> > ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1605108888 https://github.com/pydata/xarray/issues/7574#issuecomment-1466263309,https://api.github.com/repos/pydata/xarray/issues/7574,1466263309,IC_kwDOAMm_X85XZWcN,5821660,2023-03-13T14:37:40Z,2023-03-13T14:37:40Z,MEMBER,"Thanks @martindurant for looking into this. I'll try to go to the bottom of this, if I find the time.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1605108888 https://github.com/pydata/xarray/issues/7574#issuecomment-1466255394,https://api.github.com/repos/pydata/xarray/issues/7574,1466255394,IC_kwDOAMm_X85XZUgi,6042212,2023-03-13T14:32:53Z,2023-03-13T14:32:53Z,CONTRIBUTOR,"Sorry, I really don't know what goes inside xarray's cache layers. It seems that fsspec is doing the right thing if it opens via one route, and parallel=True shouldn't require any serialisation for the in-process threaded scheduler.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1605108888 https://github.com/pydata/xarray/issues/7574#issuecomment-1463576675,https://api.github.com/repos/pydata/xarray/issues/7574,1463576675,IC_kwDOAMm_X85XPGhj,5821660,2023-03-10T10:12:27Z,2023-03-10T10:12:27Z,MEMBER,"The difference between `parallel=False` vs `parallel=True` is how the cache is setup: - `parallel=False`, the cache-key looks like: `[, (,), 'r', (('decode_vlen_strings', True), ('invalid_netcdf', None)), '50fa07a0-7db1-420d-b2e1-5ad4c1328be4']` - `parallel=True`, the cache-key looks like: `[, ((b'FHIB\x00\x1dn""\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00u[...cutted...],), 'r', (('decode_vlen_strings', True), ('invalid_netcdf', None)), 'be975f9d-710d-4c2e-9e03-606b6f8f0810']` So this has something to do, when and how the cache is initialized. In the first case we've got the file-like object, but in the second case we've got the binary string. @martindurant do you have some idea what's going on here and where to look for fixing this?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1605108888 https://github.com/pydata/xarray/issues/7574#issuecomment-1463528885,https://api.github.com/repos/pydata/xarray/issues/7574,1463528885,IC_kwDOAMm_X85XO621,5821660,2023-03-10T09:36:38Z,2023-03-10T09:36:38Z,MEMBER,"@Berhinj The issue is with `parallel=True`. The following works as expected: ```python import fsspec import xarray as xr paths = [ 's3://noaa-goes16/ABI-L2-LSTC/2022/185/03/OR_ABI-L2-LSTC-M6_G16_s20221850301180_e20221850303553_c20221850305091.nc', 's3://noaa-goes16/ABI-L2-LSTC/2022/185/02/OR_ABI-L2-LSTC-M6_G16_s20221850201180_e20221850203553_c20221850205142.nc' ] fs = fsspec.filesystem('s3', anon=True) flist = [fs.open(path, mode=""rb"") for path in paths] # single dataset ds = xr.open_dataset(flist[0], engine=""h5netcdf"") print(ds) # multiple datasets, ATT: parallel=False works ts = xr.open_mfdataset( flist, engine=""h5netcdf"", combine=""nested"", concat_dim=""t"", parallel=False ) print(ts) ``` I've not digged further, but it looks like it has some issues with the file-object... ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1605108888 https://github.com/pydata/xarray/issues/7574#issuecomment-1463454208,https://api.github.com/repos/pydata/xarray/issues/7574,1463454208,IC_kwDOAMm_X85XOooA,34525099,2023-03-10T08:35:47Z,2023-03-10T08:35:47Z,NONE,"Unfortunately, the data is stored in byte and I remember having error message when using engine=""netcdf4"" which doesn't support bytes apparently and suggests using ""h5netcdf"". I'm afraid you're right the issue is on h5netcdf side as I've tried some of the solution you provided without success, I'll try the rest of them - I guess I'll post an issue on h5netcdf repo","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1605108888