home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

2 rows where issue = 1605108888 and user = 127195910 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • Karimat22 · 2 ✖

issue 1

  • xr.open_mfdataset doesn't work with fsspec and dask · 2 ✖

author_association 1

  • NONE 2
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1460907454 https://github.com/pydata/xarray/issues/7574#issuecomment-1460907454 https://api.github.com/repos/pydata/xarray/issues/7574 IC_kwDOAMm_X85XE62- Karimat22 127195910 2023-03-08T21:34:49Z 2023-03-15T16:54:13Z NONE

@jonas-constellr

It's possible that the failure you're experiencing is due to an issue with how the h5netcdf library is interacting with Dask.

One potential solution to this issue is to try using the netCDF4 library instead of h5netcdf. netCDF4 is another popular library for reading and writing netCDF files, and it has built-in support for parallel I/O through Dask.

To use netCDF4 with xarray, you can simply pass the 'netcdf4' engine to the xr.open_mfdataset function:

python import xarray as xr

Open multiple netCDF files with netCDF4 engine and parallel I/O

ds = xr.open_mfdataset('path/to/files/*.nc', engine='netcdf4', parallel=True) If you need to use h5netcdf for some reason, another potential solution is to use the dask.array.from_delayed function to manually create a Dask array from the h5netcdf data. This can be done by first reading in the data using h5netcdf, and then using dask.delayed to parallelize the data loading across multiple chunks. Here's an example:

python

import h5netcdf import dask.array as da from dask import delayed

Define function to read in a single chunk of data from the netCDF file

@delayed def read_chunk(filename, varname, start, count): with h5netcdf.File(filename, 'r') as f: var = f[varname][start[0]:start[0]+count[0], start[1]:start[1]+count[1]] return var

Define function to read in the entire dataset using dask.array.from_delayed

def read_data(files, varname): chunks = (1000, 1000) # Define chunk size data = [read_chunk(f, varname, start, chunks) for f in files] data = [da.from_delayed(d, shape=chunks, dtype='float64') for d in data] data = da.concatenate(data, axis=0) return data

Open multiple netCDF files with h5netcdf engine and parallel I/O

files = ['path/to/files/file1.nc', 'path/to/files/file2.nc', ...] varname = 'my_variable' data = read_data(files, varname) This code reads in the data from each file in chunks, and returns a Dask array that is a concatenation of all the chunks. The read_chunk function uses h5netcdf.File to read in a single chunk of data from a file, and returns a delayed object that represents the loading of that chunk. The read_data function uses dask.delayed to parallelize the loading of the chunks across all the files, and then uses dask.array.from_delayed to create a Dask array from the delayed objects. Finally, the function returns the concatenated Dask array.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_mfdataset doesn't work with fsspec and dask 1605108888
1460903268 https://github.com/pydata/xarray/issues/7574#issuecomment-1460903268 https://api.github.com/repos/pydata/xarray/issues/7574 IC_kwDOAMm_X85XE51k Karimat22 127195910 2023-03-08T21:30:53Z 2023-03-15T16:53:58Z NONE

What happened?

I was trying to read multiple byte netcdf (requires h5netcdf engine) file with xr.open_mfdataset with parallel=True to leverage dask.delayed capabilities (parallel=False works though) but it failed.

The netcdf files were noaa-goes16 satellite images, but I can't tell if it matters.

What did you expect to happen?

It should have loaded all the netcdf files into a xarray.DataSet object

Minimal Complete Verifiable Example

```

Python

import fsspec

import xarray as xr

paths = [

's3://noaa-goes16/ABI-L2-LSTC/2022/185/03/OR_ABI-L2-LSTC-M6_G16_s20221850301180_e20221850303553_c20221850305091.nc',

's3://noaa-goes16/ABI-L2-LSTC/2022/185/02/OR_ABI-L2-LSTC-M6_G16_s20221850201180_e20221850203553_c20221850205142.nc'

]

fs = fsspec.filesystem('s3')

xr.open_mfdataset(

[fs.open(path, mode="rb") for path in paths],

engine="h5netcdf",

combine="nested",

concat_dim="t",

parallel=True

)

```

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.

  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.

  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.

  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

```Python


KeyError Traceback (most recent call last)

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/file_manager.py:210, in CachingFileManager._acquire_with_cache_info(self, needs_lock)

209 try:

--> 210 file = self._cache[self._key]

211 except KeyError:

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/lru_cache.py:56, in LRUCache.getitem(self, key)

 55 with self._lock:

---> 56 value = self._cache[key]

 57     self._cache.move_to_end(key)

?[0;31mKeyError?[0m: [<class 'h5netcdf.core.File'>, ((b'\x89HDF\r\n', b'\x1a\n', b'\x02\x08\x08\x00\x00\x00 ... EXTREMELY STRING ... 00\x00\x00\x00\x00\x00\x0ef']

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)

Cell In[9], line 11

  4 paths = [

  5     's3://noaa-goes16/ABI-L2-LSTC/2022/185/03/OR_ABI-L2-LSTC-M6_G16_s20221850301180_e20221850303553_c20221850305091.nc',

  6     's3://noaa-goes16/ABI-L2-LSTC/2022/185/02/OR_ABI-L2-LSTC-M6_G16_s20221850201180_e20221850203553_c20221850205142.nc'

  7     ]

  9 fs = fsspec.filesystem('s3')

---> 11 xr.open_mfdataset(

 12     [fs.open(path, mode="rb") for path in paths],

 13     engine="h5netcdf",

 14     combine="nested",

 15     concat_dim="t",

 16     parallel=True

 17 ).LST

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/api.py:991, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)

986     datasets = [preprocess(ds) for ds in datasets]

988 if parallel:

989     # calling compute here will return the datasets/file_objs lists,

990     # the underlying datasets will still be stored as dask arrays

--> 991 datasets, closers = dask.compute(datasets, closers)

993 # Combine all datasets, closing them in case of a ValueError

994 try:

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/base.py:599, in compute(traverse, optimize_graph, scheduler, get, args, *kwargs)

596     keys.append(x.__dask_keys__())

597     postcomputes.append(x.__dask_postcompute__())

--> 599 results = schedule(dsk, keys, **kwargs)

600 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/threaded.py:89, in get(dsk, keys, cache, num_workers, pool, **kwargs)

 86     elif isinstance(pool, multiprocessing.pool.Pool):

 87         pool = MultiprocessingPoolExecutor(pool)

---> 89 results = get_async(

 90     pool.submit,

 91     pool._max_workers,

 92     dsk,

 93     keys,

 94     cache=cache,

 95     get_id=_thread_get_id,

 96     pack_exception=pack_exception,

 97     **kwargs,

 98 )

100 # Cleanup pools associated to dead threads

101 with pools_lock:

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:511, in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)

509         _execute_task(task, data)  # Re-execute locally

510     else:

--> 511 raise_exception(exc, tb)

512 res, worker_id = loads(res_info)

513 state["cache"][key] = res

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:319, in reraise(exc, tb)

317 if exc.__traceback__ is not tb:

318     raise exc.with_traceback(tb)

--> 319 raise exc

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:224, in execute_task(key, task_info, dumps, loads, get_id, pack_exception)

222 try:

223     task, data = loads(task_info)

--> 224 result = _execute_task(task, data)

225     id = get_id()

226     result = dumps((result, id))

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk)

115     func, args = arg[0], arg[1:]

116     # Note: Don't assign the subtask results to a variable. numpy detects

117     # temporaries by their reference count and can execute certain

118     # operations in-place.

--> 119 return func(*(_execute_task(a, cache) for a in args))

120 elif not ishashable(arg):

121     return arg

File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/utils.py:73, in apply(func, args, kwargs)

 42 """Apply a function given its positional and keyword arguments.

 43

 44 Equivalent to ``func(*args, **kwargs)``

(...)

...

---> 19 filename = fspath(filename)

 20     if sys.platform == "win32":

 21         if isinstance(filename, str):

TypeError: expected str, bytes or os.PathLike object, not tuple

```

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS


commit: None

python: 3.11.0 | packaged by conda-forge | (main, Jan 15 2023, 05:44:48) [Clang 14.0.6 ]

python-bits: 64

OS: Darwin

OS-release: 21.6.0

machine: x86_64

processor: i386

byteorder: little

LC_ALL: None

LANG: None

LOCALE: (None, 'UTF-8')

libhdf5: 1.12.2

libnetcdf: None

xarray: 2023.2.0

pandas: 1.5.3

numpy: 1.24.2

scipy: 1.10.1

netCDF4: None

pydap: None

h5netcdf: 1.1.0

h5py: 3.8.0

Nio: None

zarr: None

cftime: None

nc_time_axis: None

PseudoNetCDF: None

rasterio: 1.3.6

cfgrib: None

iris: None

bottleneck: None

dask: 2023.2.1

distributed: 2023.2.1

matplotlib: 3.7.0

cartopy: 0.21.1

seaborn: 0.12.2

numbagg: None

fsspec: 2023.1.0

cupy: None

pint: None

sparse: None

flox: None

numpy_groupies: None

setuptools: 67.4.0

pip: 23.0.1

conda: None

pytest: 7.2.1

mypy: None

IPython: 8.10.0

sphinx: None

/Users/jo/miniconda3/envs/rxr/lib/python3.11/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.

warnings.warn("Setuptools is replacing distutils.")

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_mfdataset doesn't work with fsspec and dask 1605108888

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 720.941ms · About: xarray-datasette