home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1605108888

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1605108888 I_kwDOAMm_X85frASY 7574 xr.open_mfdataset doesn't work with fsspec and dask 10137 closed 0     12 2023-03-01T14:45:56Z 2023-09-08T00:33:41Z 2023-09-08T00:33:41Z NONE      

What happened?

I was trying to read multiple byte netcdf (requires h5netcdf engine) file with xr.open_mfdataset with parallel=True to leverage dask.delayed capabilities (parallel=False works though) but it failed.

The netcdf files were noaa-goes16 satellite images, but I can't tell if it matters.

What did you expect to happen?

It should have loaded all the netcdf files into a xarray.DataSet object

Minimal Complete Verifiable Example

```python import fsspec import xarray as xr

paths = [ 's3://noaa-goes16/ABI-L2-LSTC/2022/185/03/OR_ABI-L2-LSTC-M6_G16_s20221850301180_e20221850303553_c20221850305091.nc', 's3://noaa-goes16/ABI-L2-LSTC/2022/185/02/OR_ABI-L2-LSTC-M6_G16_s20221850201180_e20221850203553_c20221850205142.nc' ]

fs = fsspec.filesystem('s3')

xr.open_mfdataset( [fs.open(path, mode="rb") for path in paths], engine="h5netcdf", combine="nested", concat_dim="t", parallel=True )

```

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

```Python

KeyError Traceback (most recent call last) File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/file_manager.py:210, in CachingFileManager._acquire_with_cache_info(self, needs_lock) 209 try: --> 210 file = self._cache[self._key] 211 except KeyError:

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/lru_cache.py:56, in LRUCache.getitem(self, key) 55 with self._lock: ---> 56 value = self._cache[key] 57 self._cache.move_to_end(key)

?[0;31mKeyError?[0m: [<class 'h5netcdf.core.File'>, ((b'\x89HDF\r\n', b'\x1a\n', b'\x02\x08\x08\x00\x00\x00 ... EXTREMELY STRING ... 00\x00\x00\x00\x00\x00\x0ef']

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last) Cell In[9], line 11 4 paths = [ 5 's3://noaa-goes16/ABI-L2-LSTC/2022/185/03/OR_ABI-L2-LSTC-M6_G16_s20221850301180_e20221850303553_c20221850305091.nc', 6 's3://noaa-goes16/ABI-L2-LSTC/2022/185/02/OR_ABI-L2-LSTC-M6_G16_s20221850201180_e20221850203553_c20221850205142.nc' 7 ] 9 fs = fsspec.filesystem('s3') ---> 11 xr.open_mfdataset( 12 [fs.open(path, mode="rb") for path in paths], 13 engine="h5netcdf", 14 combine="nested", 15 concat_dim="t", 16 parallel=True 17 ).LST

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/api.py:991, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs) 986 datasets = [preprocess(ds) for ds in datasets] 988 if parallel: 989 # calling compute here will return the datasets/file_objs lists, 990 # the underlying datasets will still be stored as dask arrays --> 991 datasets, closers = dask.compute(datasets, closers) 993 # Combine all datasets, closing them in case of a ValueError 994 try:

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/base.py:599, in compute(traverse, optimize_graph, scheduler, get, args, kwargs) 596 keys.append(x.dask_keys()) 597 postcomputes.append(x.dask_postcompute()) --> 599 results = schedule(dsk, keys, kwargs) 600 return repack([f(r, a) for r, (f, a) in zip(results, postcomputes)])

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/threaded.py:89, in get(dsk, keys, cache, num_workers, pool, kwargs) 86 elif isinstance(pool, multiprocessing.pool.Pool): 87 pool = MultiprocessingPoolExecutor(pool) ---> 89 results = get_async( 90 pool.submit, 91 pool._max_workers, 92 dsk, 93 keys, 94 cache=cache, 95 get_id=_thread_get_id, 96 pack_exception=pack_exception, 97 kwargs, 98 ) 100 # Cleanup pools associated to dead threads 101 with pools_lock:

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:511, in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs) 509 _execute_task(task, data) # Re-execute locally 510 else: --> 511 raise_exception(exc, tb) 512 res, worker_id = loads(res_info) 513 state["cache"][key] = res

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:319, in reraise(exc, tb) 317 if exc.traceback is not tb: 318 raise exc.with_traceback(tb) --> 319 raise exc

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:224, in execute_task(key, task_info, dumps, loads, get_id, pack_exception) 222 try: 223 task, data = loads(task_info) --> 224 result = _execute_task(task, data) 225 id = get_id() 226 result = dumps((result, id))

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk) 115 func, args = arg[0], arg[1:] 116 # Note: Don't assign the subtask results to a variable. numpy detects 117 # temporaries by their reference count and can execute certain 118 # operations in-place. --> 119 return func(*(_execute_task(a, cache) for a in args)) 120 elif not ishashable(arg): 121 return arg

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/utils.py:73, in apply(func, args, kwargs) 42 """Apply a function given its positional and keyword arguments. 43 44 Equivalent to func(*args, **kwargs) (...) ... ---> 19 filename = fspath(filename) 20 if sys.platform == "win32": 21 if isinstance(filename, str):

TypeError: expected str, bytes or os.PathLike object, not tuple ```

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.0 | packaged by conda-forge | (main, Jan 15 2023, 05:44:48) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 21.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.12.2 libnetcdf: None xarray: 2023.2.0 pandas: 1.5.3 numpy: 1.24.2 scipy: 1.10.1 netCDF4: None pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.6 cfgrib: None iris: None bottleneck: None dask: 2023.2.1 distributed: 2023.2.1 matplotlib: 3.7.0 cartopy: 0.21.1 seaborn: 0.12.2 numbagg: None fsspec: 2023.1.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 67.4.0 pip: 23.0.1 conda: None pytest: 7.2.1 mypy: None IPython: 8.10.0 sphinx: None [/Users/jo/miniconda3/envs/rxr/lib/python3.11/site-packages/_distutils_hack/__init__.py:33](https://file+.vscode-resource.vscode-cdn.net/Users/jo/miniconda3/envs/rxr/lib/python3.11/site-packages/_distutils_hack/__init__.py:33): UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7574/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 7 rows from issue in issue_comments
Powered by Datasette · Queries took 0.826ms · About: xarray-datasette