html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/2190#issuecomment-392672562,https://api.github.com/repos/pydata/xarray/issues/2190,392672562,MDEyOklzc3VlQ29tbWVudDM5MjY3MjU2Mg==,1217238,2018-05-29T06:59:32Z,2018-05-29T06:59:32Z,MEMBER,"Indeed, HDF5 supports parallel IO, but only with MPI. Unfortunately that didn't work with Dask, at least not yet. Zarr is certainly worth a try for performance. The motivation for zarr (rather than HDF5) was performance with distributed reads/writes, especially with cloud storage. On Mon, May 28, 2018 at 11:27 PM Karel van de Plassche < notifications@github.com> wrote: > @shoyer Thanks for your answer. Too bad. > Maybe this could be documented in the 'dask' chapter? Or maybe even raise a > warning when using open_dataset with lock=False on a netCDF4 file? > > Unfortunately there seems to be some conflicting information floating > around, which is hard to spot for a non-expert like me. It might of course > just be that xarray doesn't support it (yet). For example: > > - python-netcdf4 support parallel read: Unidata/netcdf4-python#536 > > - python-netcdf4 MPI parallel write/read: > https://github.com/Unidata/netcdf4-python/blob/master/examples/mpi_example.py > http://unidata.github.io/netcdf4-python/#section13 > - Using h5py directly (not supported by xarray I think): > http://docs.h5py.org/en/latest/mpi.html > - Seems to suggest multiple read is fine: dask/dask#3074 (comment) > > > You might have better luck using dask-distributed multiple processes, but > then you'll encounter other bottlenecks with data transfer. > > I'll do some more experiments, thanks for this suggestion. I am not bound > to netCDF4 (although I need the compression, so no netCDF3 unfortunately), > so would moving to Zarr help improving IO performance? I'd really like to > keep using xarray, thanks for this awesome library! Even with the disk IO > performance hit, it's still more than worth it to use it. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > , or mute > the thread > > . > ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,327064908 https://github.com/pydata/xarray/issues/2190#issuecomment-392649160,https://api.github.com/repos/pydata/xarray/issues/2190,392649160,MDEyOklzc3VlQ29tbWVudDM5MjY0OTE2MA==,1217238,2018-05-29T04:24:58Z,2018-05-29T04:24:58Z,MEMBER,"Maybe there's some place we could document this more clearly? `lock=False` would still be useful if you're reading/writing netCDF3 files.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,327064908 https://github.com/pydata/xarray/issues/2190#issuecomment-392647556,https://api.github.com/repos/pydata/xarray/issues/2190,392647556,MDEyOklzc3VlQ29tbWVudDM5MjY0NzU1Ng==,1217238,2018-05-29T04:11:55Z,2018-05-29T04:11:55Z,MEMBER,"Unfortunately HDF5 doesn't support reading or writing files (even different files) in parallel via the same process, which is why xarray by default adds a lock around all read/write operations from NetCDF4/HDF5 files. So I'm afraid this is expected behavior. You might have better luck using dask-distributed multiple processes, but then you'll encounter other bottlenecks with data transfer.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,327064908