home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 392672562

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/2190#issuecomment-392672562 https://api.github.com/repos/pydata/xarray/issues/2190 392672562 MDEyOklzc3VlQ29tbWVudDM5MjY3MjU2Mg== 1217238 2018-05-29T06:59:32Z 2018-05-29T06:59:32Z MEMBER

Indeed, HDF5 supports parallel IO, but only with MPI. Unfortunately that didn't work with Dask, at least not yet.

Zarr is certainly worth a try for performance. The motivation for zarr (rather than HDF5) was performance with distributed reads/writes, especially with cloud storage. On Mon, May 28, 2018 at 11:27 PM Karel van de Plassche notifications@github.com wrote:

@shoyer https://github.com/shoyer Thanks for your answer. Too bad. Maybe this could be documented in the 'dask' chapter? Or maybe even raise a warning when using open_dataset with lock=False on a netCDF4 file?

Unfortunately there seems to be some conflicting information floating around, which is hard to spot for a non-expert like me. It might of course just be that xarray doesn't support it (yet). For example:

  • python-netcdf4 support parallel read: Unidata/netcdf4-python#536 https://github.com/Unidata/netcdf4-python/issues/536
  • python-netcdf4 MPI parallel write/read: https://github.com/Unidata/netcdf4-python/blob/master/examples/mpi_example.py http://unidata.github.io/netcdf4-python/#section13
  • Using h5py directly (not supported by xarray I think): http://docs.h5py.org/en/latest/mpi.html
  • Seems to suggest multiple read is fine: dask/dask#3074 (comment) https://github.com/dask/dask/issues/3074#issuecomment-359030028

You might have better luck using dask-distributed multiple processes, but then you'll encounter other bottlenecks with data transfer.

I'll do some more experiments, thanks for this suggestion. I am not bound to netCDF4 (although I need the compression, so no netCDF3 unfortunately), so would moving to Zarr help improving IO performance? I'd really like to keep using xarray, thanks for this awesome library! Even with the disk IO performance hit, it's still more than worth it to use it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2190#issuecomment-392666250, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1pE46j-sU2hCgTUeBAg9VyTpv5ESks5t3OppgaJpZM4UQXTS .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  327064908
Powered by Datasette · Queries took 0.765ms · About: xarray-datasette