home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 1460907454

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/7574#issuecomment-1460907454 https://api.github.com/repos/pydata/xarray/issues/7574 1460907454 IC_kwDOAMm_X85XE62- 127195910 2023-03-08T21:34:49Z 2023-03-15T16:54:13Z NONE

@jonas-constellr

It's possible that the failure you're experiencing is due to an issue with how the h5netcdf library is interacting with Dask.

One potential solution to this issue is to try using the netCDF4 library instead of h5netcdf. netCDF4 is another popular library for reading and writing netCDF files, and it has built-in support for parallel I/O through Dask.

To use netCDF4 with xarray, you can simply pass the 'netcdf4' engine to the xr.open_mfdataset function:

python import xarray as xr

Open multiple netCDF files with netCDF4 engine and parallel I/O

ds = xr.open_mfdataset('path/to/files/*.nc', engine='netcdf4', parallel=True) If you need to use h5netcdf for some reason, another potential solution is to use the dask.array.from_delayed function to manually create a Dask array from the h5netcdf data. This can be done by first reading in the data using h5netcdf, and then using dask.delayed to parallelize the data loading across multiple chunks. Here's an example:

python

import h5netcdf import dask.array as da from dask import delayed

Define function to read in a single chunk of data from the netCDF file

@delayed def read_chunk(filename, varname, start, count): with h5netcdf.File(filename, 'r') as f: var = f[varname][start[0]:start[0]+count[0], start[1]:start[1]+count[1]] return var

Define function to read in the entire dataset using dask.array.from_delayed

def read_data(files, varname): chunks = (1000, 1000) # Define chunk size data = [read_chunk(f, varname, start, chunks) for f in files] data = [da.from_delayed(d, shape=chunks, dtype='float64') for d in data] data = da.concatenate(data, axis=0) return data

Open multiple netCDF files with h5netcdf engine and parallel I/O

files = ['path/to/files/file1.nc', 'path/to/files/file2.nc', ...] varname = 'my_variable' data = read_data(files, varname) This code reads in the data from each file in chunks, and returns a Dask array that is a concatenation of all the chunks. The read_chunk function uses h5netcdf.File to read in a single chunk of data from a file, and returns a delayed object that represents the loading of that chunk. The read_data function uses dask.delayed to parallelize the loading of the chunks across all the files, and then uses dask.array.from_delayed to create a Dask array from the delayed objects. Finally, the function returns the concatenated Dask array.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  1605108888
Powered by Datasette · Queries took 0.754ms · About: xarray-datasette