issue_comments: 1460907454
This data as json
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/pydata/xarray/issues/7574#issuecomment-1460907454 | https://api.github.com/repos/pydata/xarray/issues/7574 | 1460907454 | IC_kwDOAMm_X85XE62- | 127195910 | 2023-03-08T21:34:49Z | 2023-03-15T16:54:13Z | NONE | @jonas-constellr It's possible that the failure you're experiencing is due to an issue with how the h5netcdf library is interacting with Dask. One potential solution to this issue is to try using the netCDF4 library instead of h5netcdf. netCDF4 is another popular library for reading and writing netCDF files, and it has built-in support for parallel I/O through Dask. To use netCDF4 with xarray, you can simply pass the 'netcdf4' engine to the xr.open_mfdataset function: python import xarray as xr Open multiple netCDF files with netCDF4 engine and parallel I/Ods = xr.open_mfdataset('path/to/files/*.nc', engine='netcdf4', parallel=True) If you need to use h5netcdf for some reason, another potential solution is to use the dask.array.from_delayed function to manually create a Dask array from the h5netcdf data. This can be done by first reading in the data using h5netcdf, and then using dask.delayed to parallelize the data loading across multiple chunks. Here's an example: python import h5netcdf import dask.array as da from dask import delayed Define function to read in a single chunk of data from the netCDF file@delayed def read_chunk(filename, varname, start, count): with h5netcdf.File(filename, 'r') as f: var = f[varname][start[0]:start[0]+count[0], start[1]:start[1]+count[1]] return var Define function to read in the entire dataset using dask.array.from_delayeddef read_data(files, varname): chunks = (1000, 1000) # Define chunk size data = [read_chunk(f, varname, start, chunks) for f in files] data = [da.from_delayed(d, shape=chunks, dtype='float64') for d in data] data = da.concatenate(data, axis=0) return data Open multiple netCDF files with h5netcdf engine and parallel I/Ofiles = ['path/to/files/file1.nc', 'path/to/files/file2.nc', ...] varname = 'my_variable' data = read_data(files, varname) This code reads in the data from each file in chunks, and returns a Dask array that is a concatenation of all the chunks. The read_chunk function uses h5netcdf.File to read in a single chunk of data from a file, and returns a delayed object that represents the loading of that chunk. The read_data function uses dask.delayed to parallelize the loading of the chunks across all the files, and then uses dask.array.from_delayed to create a Dask array from the delayed objects. Finally, the function returns the concatenated Dask array. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
1605108888 |