html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/2190#issuecomment-392672562,https://api.github.com/repos/pydata/xarray/issues/2190,392672562,MDEyOklzc3VlQ29tbWVudDM5MjY3MjU2Mg==,1217238,2018-05-29T06:59:32Z,2018-05-29T06:59:32Z,MEMBER,"Indeed, HDF5 supports parallel IO, but only with MPI. Unfortunately that
didn't work with Dask, at least not yet.
Zarr is certainly worth a try for performance. The motivation for zarr
(rather than HDF5) was performance with distributed reads/writes,
especially with cloud storage.
On Mon, May 28, 2018 at 11:27 PM Karel van de Plassche <
notifications@github.com> wrote:
> @shoyer Thanks for your answer. Too bad.
> Maybe this could be documented in the 'dask' chapter? Or maybe even raise a
> warning when using open_dataset with lock=False on a netCDF4 file?
>
> Unfortunately there seems to be some conflicting information floating
> around, which is hard to spot for a non-expert like me. It might of course
> just be that xarray doesn't support it (yet). For example:
>
> - python-netcdf4 support parallel read: Unidata/netcdf4-python#536
>
> - python-netcdf4 MPI parallel write/read:
> https://github.com/Unidata/netcdf4-python/blob/master/examples/mpi_example.py
> http://unidata.github.io/netcdf4-python/#section13
> - Using h5py directly (not supported by xarray I think):
> http://docs.h5py.org/en/latest/mpi.html
> - Seems to suggest multiple read is fine: dask/dask#3074 (comment)
>
>
> You might have better luck using dask-distributed multiple processes, but
> then you'll encounter other bottlenecks with data transfer.
>
> I'll do some more experiments, thanks for this suggestion. I am not bound
> to netCDF4 (although I need the compression, so no netCDF3 unfortunately),
> so would moving to Zarr help improving IO performance? I'd really like to
> keep using xarray, thanks for this awesome library! Even with the disk IO
> performance hit, it's still more than worth it to use it.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
>
> .
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,327064908
https://github.com/pydata/xarray/issues/2190#issuecomment-392649160,https://api.github.com/repos/pydata/xarray/issues/2190,392649160,MDEyOklzc3VlQ29tbWVudDM5MjY0OTE2MA==,1217238,2018-05-29T04:24:58Z,2018-05-29T04:24:58Z,MEMBER,"Maybe there's some place we could document this more clearly?
`lock=False` would still be useful if you're reading/writing netCDF3 files.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,327064908
https://github.com/pydata/xarray/issues/2190#issuecomment-392647556,https://api.github.com/repos/pydata/xarray/issues/2190,392647556,MDEyOklzc3VlQ29tbWVudDM5MjY0NzU1Ng==,1217238,2018-05-29T04:11:55Z,2018-05-29T04:11:55Z,MEMBER,"Unfortunately HDF5 doesn't support reading or writing files (even different files) in parallel via the same process, which is why xarray by default adds a lock around all read/write operations from NetCDF4/HDF5 files. So I'm afraid this is expected behavior.
You might have better luck using dask-distributed multiple processes, but then you'll encounter other bottlenecks with data transfer.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,327064908