github: issue_comments: 2 rows where issue = 1605108888 and user = 127195910 sorted by updated

2 rows where issue = 1605108888 and user = 127195910 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	performed_via_github_app	issue
1460907454	https://github.com/pydata/xarray/issues/7574#issuecomment-1460907454	https://api.github.com/repos/pydata/xarray/issues/7574	IC_kwDOAMm_X85XE62-	Karimat22 127195910	2023-03-08T21:34:49Z	2023-03-15T16:54:13Z	NONE	@jonas-constellr It's possible that the failure you're experiencing is due to an issue with how the h5netcdf library is interacting with Dask. One potential solution to this issue is to try using the netCDF4 library instead of h5netcdf. netCDF4 is another popular library for reading and writing netCDF files, and it has built-in support for parallel I/O through Dask. To use netCDF4 with xarray, you can simply pass the 'netcdf4' engine to the xr.open_mfdataset function: python import xarray as xr Open multiple netCDF files with netCDF4 engine and parallel I/O ds = xr.open_mfdataset('path/to/files/*.nc', engine='netcdf4', parallel=True) If you need to use h5netcdf for some reason, another potential solution is to use the dask.array.from_delayed function to manually create a Dask array from the h5netcdf data. This can be done by first reading in the data using h5netcdf, and then using dask.delayed to parallelize the data loading across multiple chunks. Here's an example: python import h5netcdf import dask.array as da from dask import delayed Define function to read in a single chunk of data from the netCDF file @delayed def read_chunk(filename, varname, start, count): with h5netcdf.File(filename, 'r') as f: var = f[varname][start[0]:start[0]+count[0], start[1]:start[1]+count[1]] return var Define function to read in the entire dataset using dask.array.from_delayed def read_data(files, varname): chunks = (1000, 1000) # Define chunk size data = [read_chunk(f, varname, start, chunks) for f in files] data = [da.from_delayed(d, shape=chunks, dtype='float64') for d in data] data = da.concatenate(data, axis=0) return data Open multiple netCDF files with h5netcdf engine and parallel I/O files = ['path/to/files/file1.nc', 'path/to/files/file2.nc', ...] varname = 'my_variable' data = read_data(files, varname) This code reads in the data from each file in chunks, and returns a Dask array that is a concatenation of all the chunks. The read_chunk function uses h5netcdf.File to read in a single chunk of data from a file, and returns a delayed object that represents the loading of that chunk. The read_data function uses dask.delayed to parallelize the loading of the chunks across all the files, and then uses dask.array.from_delayed to create a Dask array from the delayed objects. Finally, the function returns the concatenated Dask array.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xr.open_mfdataset doesn't work with fsspec and dask 1605108888
1460903268	https://github.com/pydata/xarray/issues/7574#issuecomment-1460903268	https://api.github.com/repos/pydata/xarray/issues/7574	IC_kwDOAMm_X85XE51k	Karimat22 127195910	2023-03-08T21:30:53Z	2023-03-15T16:53:58Z	NONE	What happened? I was trying to read multiple byte netcdf (requires h5netcdf engine) file with xr.open_mfdataset with parallel=True to leverage dask.delayed capabilities (parallel=False works though) but it failed. The netcdf files were noaa-goes16 satellite images, but I can't tell if it matters. What did you expect to happen? It should have loaded all the netcdf files into a xarray.DataSet object Minimal Complete Verifiable Example ``` Python import fsspec import xarray as xr paths = [ `'s3://noaa-goes16/ABI-L2-LSTC/2022/185/03/OR_ABI-L2-LSTC-M6_G16_s20221850301180_e20221850303553_c20221850305091.nc', 's3://noaa-goes16/ABI-L2-LSTC/2022/185/02/OR_ABI-L2-LSTC-M6_G16_s20221850201180_e20221850203553_c20221850205142.nc' ]` fs = fsspec.filesystem('s3') xr.open_mfdataset( `[fs.open(path, mode="rb") for path in paths], engine="h5netcdf", combine="nested", concat_dim="t", parallel=True` ) ``` MVCE confirmation [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. [X] Complete example — the example is self-contained, including all data and the text of any traceback. [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result. [X] New issue — a search of GitHub Issues suggests this is not a duplicate. Relevant log output ```Python KeyError Traceback (most recent call last) File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/file_manager.py:210, in CachingFileManager._acquire_with_cache_info(self, needs_lock) `209 try:` --> 210 file = self._cache[self._key] `211 except KeyError:` File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/lru_cache.py:56, in LRUCache.getitem(self, key) `55 with self._lock:` ---> 56 value = self._cache[key] `57 self._cache.move_to_end(key)` ?[0;31mKeyError?[0m: [<class 'h5netcdf.core.File'>, ((b'\x89HDF\r\n', b'\x1a\n', b'\x02\x08\x08\x00\x00\x00 ... EXTREMELY STRING ... 00\x00\x00\x00\x00\x00\x0ef'] During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) Cell In[9], line 11 `4 paths = [ 5 's3://noaa-goes16/ABI-L2-LSTC/2022/185/03/OR_ABI-L2-LSTC-M6_G16_s20221850301180_e20221850303553_c20221850305091.nc', 6 's3://noaa-goes16/ABI-L2-LSTC/2022/185/02/OR_ABI-L2-LSTC-M6_G16_s20221850201180_e20221850203553_c20221850205142.nc' 7 ] 9 fs = fsspec.filesystem('s3')` ---> 11 xr.open_mfdataset( `12 [fs.open(path, mode="rb") for path in paths], 13 engine="h5netcdf", 14 combine="nested", 15 concat_dim="t", 16 parallel=True 17 ).LST` File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/xarray/backends/api.py:991, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, *kwargs) `986 datasets = [preprocess(ds) for ds in datasets] 988 if parallel: 989 # calling compute here will return the datasets/file_objs lists, 990 # the underlying datasets will still be stored as dask arrays` --> 991 datasets, closers = dask.compute(datasets, closers) `993 # Combine all datasets, closing them in case of a ValueError 994 try:` File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/base.py:599, in compute(traverse, optimize_graph, scheduler, get, args,* kwargs) `596 keys.append(x.__dask_keys__()) 597 postcomputes.append(x.__dask_postcompute__())` --> 599 results = schedule(dsk, keys, kwargs) `600 return repack([f(r, a) for r, (f, a) in zip(results, postcomputes)])` File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/threaded.py:89, in get(dsk, keys, cache, num_workers, pool, kwargs) `86 elif isinstance(pool, multiprocessing.pool.Pool): 87 pool = MultiprocessingPoolExecutor(pool)` ---> 89 results = get_async( `90 pool.submit, 91 pool._max_workers, 92 dsk, 93 keys, 94 cache=cache, 95 get_id=_thread_get_id, 96 pack_exception=pack_exception, 97 kwargs, 98 ) 100 # Cleanup pools associated to dead threads 101 with pools_lock:` File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:511, in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, *kwargs) `509 _execute_task(task, data) # Re-execute locally 510 else:` --> 511 raise_exception(exc, tb) `512 res, worker_id = loads(res_info) 513 state["cache"][key] = res` File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:319, in reraise(exc, tb) `317 if exc.__traceback__ is not tb: 318 raise exc.with_traceback(tb)` --> 319 raise exc File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/local.py:224, in execute_task(key, task_info, dumps, loads, get_id, pack_exception) `222 try: 223 task, data = loads(task_info)` --> 224 result = _execute_task(task, data) `225 id = get_id() 226 result = dumps((result, id))` File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk) `115 func, args = arg[0], arg[1:] 116 # Note: Don't assign the subtask results to a variable. numpy detects 117 # temporaries by their reference count and can execute certain 118 # operations in-place.` --> 119 return func((_execute_task(a, cache) for a in args)) `120 elif not ishashable(arg): 121 return arg` File ~/miniconda3/envs/rxr/lib/python3.11/site-packages/dask/utils.py:73, in apply(func, args, kwargs) 42 """Apply a function given its positional and keyword arguments. 43 44 Equivalent to ``func(args, kwargs)`` (...) ... ---> 19 filename = fspath(filename) `20 if sys.platform == "win32": 21 if isinstance(filename, str):` TypeError: expected str, bytes or os.PathLike object, not tuple ``` Anything else we need to know? No response* Environment INSTALLED VERSIONS commit: None python: 3.11.0 \| packaged by conda-forge \| (main, Jan 15 2023, 05:44:48) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 21.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.12.2 libnetcdf: None xarray: 2023.2.0 pandas: 1.5.3 numpy: 1.24.2 scipy: 1.10.1 netCDF4: None pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.6 cfgrib: None iris: None bottleneck: None dask: 2023.2.1 distributed: 2023.2.1 matplotlib: 3.7.0 cartopy: 0.21.1 seaborn: 0.12.2 numbagg: None fsspec: 2023.1.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 67.4.0 pip: 23.0.1 conda: None pytest: 7.2.1 mypy: None IPython: 8.10.0 sphinx: None /Users/jo/miniconda3/envs/rxr/lib/python3.11/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xr.open_mfdataset doesn't work with fsspec and dask 1605108888

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

2 rows where issue = 1605108888 and user = 127195910 sorted by updated_at descending

Open multiple netCDF files with netCDF4 engine and parallel I/O

Define function to read in a single chunk of data from the netCDF file

Define function to read in the entire dataset using dask.array.from_delayed

Open multiple netCDF files with h5netcdf engine and parallel I/O

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

Advanced export