home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where issue = 327064908 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • shoyer 3
  • max-sixty 1
  • Karel-van-de-Plassche 1

author_association 2

  • MEMBER 4
  • CONTRIBUTOR 1

issue 1

  • Parallel non-locked read using dask.Client crashes · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
454162108 https://github.com/pydata/xarray/issues/2190#issuecomment-454162108 https://api.github.com/repos/pydata/xarray/issues/2190 MDEyOklzc3VlQ29tbWVudDQ1NDE2MjEwOA== max-sixty 5635139 2019-01-14T21:09:03Z 2019-01-14T21:09:03Z MEMBER

In an effort to reduce the issue backlog, I'll close this, but please reopen if you disagree

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel non-locked read using dask.Client crashes 327064908
392672562 https://github.com/pydata/xarray/issues/2190#issuecomment-392672562 https://api.github.com/repos/pydata/xarray/issues/2190 MDEyOklzc3VlQ29tbWVudDM5MjY3MjU2Mg== shoyer 1217238 2018-05-29T06:59:32Z 2018-05-29T06:59:32Z MEMBER

Indeed, HDF5 supports parallel IO, but only with MPI. Unfortunately that didn't work with Dask, at least not yet.

Zarr is certainly worth a try for performance. The motivation for zarr (rather than HDF5) was performance with distributed reads/writes, especially with cloud storage. On Mon, May 28, 2018 at 11:27 PM Karel van de Plassche notifications@github.com wrote:

@shoyer https://github.com/shoyer Thanks for your answer. Too bad. Maybe this could be documented in the 'dask' chapter? Or maybe even raise a warning when using open_dataset with lock=False on a netCDF4 file?

Unfortunately there seems to be some conflicting information floating around, which is hard to spot for a non-expert like me. It might of course just be that xarray doesn't support it (yet). For example:

  • python-netcdf4 support parallel read: Unidata/netcdf4-python#536 https://github.com/Unidata/netcdf4-python/issues/536
  • python-netcdf4 MPI parallel write/read: https://github.com/Unidata/netcdf4-python/blob/master/examples/mpi_example.py http://unidata.github.io/netcdf4-python/#section13
  • Using h5py directly (not supported by xarray I think): http://docs.h5py.org/en/latest/mpi.html
  • Seems to suggest multiple read is fine: dask/dask#3074 (comment) https://github.com/dask/dask/issues/3074#issuecomment-359030028

You might have better luck using dask-distributed multiple processes, but then you'll encounter other bottlenecks with data transfer.

I'll do some more experiments, thanks for this suggestion. I am not bound to netCDF4 (although I need the compression, so no netCDF3 unfortunately), so would moving to Zarr help improving IO performance? I'd really like to keep using xarray, thanks for this awesome library! Even with the disk IO performance hit, it's still more than worth it to use it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2190#issuecomment-392666250, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1pE46j-sU2hCgTUeBAg9VyTpv5ESks5t3OppgaJpZM4UQXTS .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel non-locked read using dask.Client crashes 327064908
392666250 https://github.com/pydata/xarray/issues/2190#issuecomment-392666250 https://api.github.com/repos/pydata/xarray/issues/2190 MDEyOklzc3VlQ29tbWVudDM5MjY2NjI1MA== Karel-van-de-Plassche 6404167 2018-05-29T06:27:52Z 2018-05-29T06:35:02Z CONTRIBUTOR

@shoyer Thanks for your answer. Too bad. Maybe this could be documented in the 'dask' chapter? Or maybe even raise a warning when using open_dataset with lock=False on a netCDF4 file?

Unfortunately there seems to be some conflicting information floating around, which is hard to spot for a non-expert like me. It might of course just be that xarray doesn't support it (yet). I think MPI-style opening is a whole different beast, right? For example:

  • python-netcdf4 support parallel read in threads: https://github.com/Unidata/netcdf4-python/issues/536
  • python-netcdf4 MPI parallel write/read: https://github.com/Unidata/netcdf4-python/blob/master/examples/mpi_example.py http://unidata.github.io/netcdf4-python/#section13
  • Using h5py directly (not supported by xarray I think): http://docs.h5py.org/en/latest/mpi.html
  • Seems to suggest multiple read is fine: https://github.com/dask/dask/issues/3074#issuecomment-359030028

You might have better luck using dask-distributed multiple processes, but then you'll encounter other bottlenecks with data transfer.

I'll do some more experiments, thanks for this suggestion. I am not bound to netCDF4 (although I need the compression, so no netCDF3 unfortunately), so would moving to Zarr help improving IO performance? I'd really like to keep using xarray, thanks for this awesome library! Even with the disk IO performance hit, it's still more than worth it to use it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel non-locked read using dask.Client crashes 327064908
392649160 https://github.com/pydata/xarray/issues/2190#issuecomment-392649160 https://api.github.com/repos/pydata/xarray/issues/2190 MDEyOklzc3VlQ29tbWVudDM5MjY0OTE2MA== shoyer 1217238 2018-05-29T04:24:58Z 2018-05-29T04:24:58Z MEMBER

Maybe there's some place we could document this more clearly?

lock=False would still be useful if you're reading/writing netCDF3 files.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel non-locked read using dask.Client crashes 327064908
392647556 https://github.com/pydata/xarray/issues/2190#issuecomment-392647556 https://api.github.com/repos/pydata/xarray/issues/2190 MDEyOklzc3VlQ29tbWVudDM5MjY0NzU1Ng== shoyer 1217238 2018-05-29T04:11:55Z 2018-05-29T04:11:55Z MEMBER

Unfortunately HDF5 doesn't support reading or writing files (even different files) in parallel via the same process, which is why xarray by default adds a lock around all read/write operations from NetCDF4/HDF5 files. So I'm afraid this is expected behavior.

You might have better luck using dask-distributed multiple processes, but then you'll encounter other bottlenecks with data transfer.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel non-locked read using dask.Client crashes 327064908

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 12.509ms · About: xarray-datasette