home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

17 rows where author_association = "MEMBER" and issue = 91184107 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 2

  • shoyer 12
  • mrocklin 5

issue 1

  • segmentation fault with `open_mfdataset` · 17 ✖

author_association 1

  • MEMBER · 17 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
120447670 https://github.com/pydata/xarray/issues/444#issuecomment-120447670 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDEyMDQ0NzY3MA== shoyer 1217238 2015-07-10T16:11:19Z 2015-07-10T16:11:19Z MEMBER

@razvanc87 I've gotten a few other reports of issues with multithreading (not just you), so I think we do definitely need to add our own lock when accessing these files. Misconfigured hdf5 installs may not be so uncommon.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
118435615 https://github.com/pydata/xarray/issues/444#issuecomment-118435615 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExODQzNTYxNQ== shoyer 1217238 2015-07-03T22:43:41Z 2015-07-03T22:43:41Z MEMBER

@razvanc87 netcdf4 and h5py use the same HDF5 libraries, but have different bindings from Python. H5py likely does a more careful job of using its own locks to ensure thread safety, which likely explains the difference you are seeing (the attribute encoding is a separate issue).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
118435484 https://github.com/pydata/xarray/issues/444#issuecomment-118435484 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExODQzNTQ4NA== shoyer 1217238 2015-07-03T22:40:57Z 2015-07-03T22:40:57Z MEMBER

The library itself is not threadsafe? What about on a per-file basis?

@andrewcollette could you comment on this for h5py/hdf5?

@mrocklin based on my reading of Andrew's comment in the h5py issue, this is indeed the case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
118373188 https://github.com/pydata/xarray/issues/444#issuecomment-118373188 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExODM3MzE4OA== mrocklin 306380 2015-07-03T15:26:18Z 2015-07-03T15:26:18Z MEMBER

The library itself is not threadsafe? What about on a per-file basis?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
118195247 https://github.com/pydata/xarray/issues/444#issuecomment-118195247 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExODE5NTI0Nw== shoyer 1217238 2015-07-02T23:45:01Z 2015-07-02T23:45:01Z MEMBER

Ah, I think I know why the seg faults are still occuring. By default, dask.array.from_array uses a thread lock that is specific to each array variable. We need a global thread lock, because the HDF5 library is not thread safe.

@mrocklin maybe da.from_array should use a global thread lock if lock=True? Alternatively, I could just change this in xray -- but I suspect that other dask users who want a lock also probably want a global lock.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
118090209 https://github.com/pydata/xarray/issues/444#issuecomment-118090209 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExODA5MDIwOQ== shoyer 1217238 2015-07-02T16:46:57Z 2015-07-02T16:46:57Z MEMBER

Thanks for your help debugging!

I made a new issue for ascii attributes handling: https://github.com/xray/xray/issues/451

This is one case where Python 3's insistence that bytes and strings are different is annoying. I'll probably have to decode all bytes type attributes read from h5netcdf.

How do you trigger the seg-fault with netcdf4-python? Just using open_mfdataset as before? I'm a little surprised that still happens with the thread lock.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
116787098 https://github.com/pydata/xarray/issues/444#issuecomment-116787098 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExNjc4NzA5OA== shoyer 1217238 2015-06-29T18:30:48Z 2015-06-29T18:30:48Z MEMBER

@razvanc87 What version of h5py were you using with h5netcdf? @andrewcollette suggests (https://github.com/h5py/h5py/issues/591#issuecomment-116785660) that h5py should already have the lock that fixes this issue if you were using h5py 2.4.0 or later.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
116779716 https://github.com/pydata/xarray/issues/444#issuecomment-116779716 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExNjc3OTcxNg== shoyer 1217238 2015-06-29T18:07:52Z 2015-06-29T18:07:52Z MEMBER

Just merged the fix to master.

@razvanc87 if you could try installing the development version, I would love to hear if this resolves your issues.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
116189535 https://github.com/pydata/xarray/issues/444#issuecomment-116189535 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExNjE4OTUzNQ== shoyer 1217238 2015-06-28T03:34:30Z 2015-06-28T03:34:30Z MEMBER

I have a tentative fix (adding the threading lock) in https://github.com/xray/xray/pull/446

Still wondering why multi-threading can't use more than one CPU -- hopefully my h5py issue (referenced above) will get us some answers.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
116182511 https://github.com/pydata/xarray/issues/444#issuecomment-116182511 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExNjE4MjUxMQ== mrocklin 306380 2015-06-28T01:55:39Z 2015-06-28T01:55:39Z MEMBER

Oh, I didn't realize that that was built in already. Sounds like you could handle this easily on the xray side. On Jun 27, 2015 4:40 PM, "Stephan Hoyer" notifications@github.com wrote:

Of course, concurrent access to HDF5 files works fine on my laptop, using Anaconda's build of HDF5 (version 1.8.14). I have no idea what special flags they invoked when building it :).

That said, I have been unable to produce any benchmarks that show improved performance when simply doing multithreaded reads without doing any computation (e.g., %time xray.open_dataset(..., chunks=...).load()). Even when I'm reading multiple independent chunks compressed on disk, CPU seems to be pegged at 100%, when using either netCDF4-python or h5py (via h5netcdf) to read the data. For non-compressed data, reads seem to be limited by disk speed, so CPU is also not relevant.

Given these considerations, it seems like we should use a lock when reading data into xray with dask. @mrocklin https://github.com/mrocklin we could just use lock=True with da.from_array, right? If we can find use cases for multi-threaded reads, we could also add an optional lock argument to open_dataset/open_mfdataset.

— Reply to this email directly or view it on GitHub https://github.com/xray/xray/issues/444#issuecomment-116165986.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
116165986 https://github.com/pydata/xarray/issues/444#issuecomment-116165986 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExNjE2NTk4Ng== shoyer 1217238 2015-06-27T23:40:29Z 2015-06-27T23:40:29Z MEMBER

Of course, concurrent access to HDF5 files works fine on my laptop, using Anaconda's build of HDF5 (version 1.8.14). I have no idea what special flags they invoked when building it :).

That said, I have been unable to produce any benchmarks that show improved performance when simply doing multithreaded reads without doing any computation (e.g., %time xray.open_dataset(..., chunks=...).load()). Even when I'm reading multiple independent chunks compressed on disk, CPU seems to be pegged at 100%, when using either netCDF4-python or h5py (via h5netcdf) to read the data. For non-compressed data, reads seem to be limited by disk speed, so CPU is also not relevant.

Given these considerations, it seems like we should use a lock when reading data into xray with dask. @mrocklin we could just use lock=True with da.from_array, right? If we can find use cases for multi-threaded reads, we could also add an optional lock argument to open_dataset/open_mfdataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
116162351 https://github.com/pydata/xarray/issues/444#issuecomment-116162351 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExNjE2MjM1MQ== mrocklin 306380 2015-06-27T22:12:37Z 2015-06-27T22:12:37Z MEMBER

There was a similar problem with PyTables, which didn't support concurrency well. This resulted in the from-hdf5 function in dask array which uses explicit locks to avoid concurrent access.

We could repeat this treatment more generally without much trouble to force single threaded access on access but still allow parallelism otherwise. On Jun 27, 2015 2:33 PM, "Răzvan Rădulescu" notifications@github.com wrote:

So I just tried @mrocklin https://github.com/mrocklin's idea with using single-threaded stuff. This seems to fix the segmentation fault, but I am very curious as to why there's a problem with working in parallel. I tried two different hdf5 libraries (I think version 1.8.13 and 1.8.14) but I got the same segmentation fault. Anyway, working on a single thread is not a big deal, I'll just do that for the time being... I already tried gdb on python but I'm not experienced enough to make heads or tails of it... I have the gdb backtrace here https://gist.github.com/razvanc87/0986c4f7a591772e1778 but I don't know what to do with it...

@shoyer https://github.com/shoyer, the files are not the issue here, they're the same ones I provided in #443 https://github.com/xray/xray/issues/443.

Question: does the hdf5 library need to be built with parallel support (mpi or something) maybe?... thanks guys

— Reply to this email directly or view it on GitHub https://github.com/xray/xray/issues/444#issuecomment-116146897.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
115930797 https://github.com/pydata/xarray/issues/444#issuecomment-115930797 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExNTkzMDc5Nw== mrocklin 306380 2015-06-27T01:09:44Z 2015-06-27T01:09:44Z MEMBER

Alternatively can we try doing the operations that xray would do manually and see if one of them triggers something?

One could also try

$ gdb python

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
115930685 https://github.com/pydata/xarray/issues/444#issuecomment-115930685 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExNTkzMDY4NQ== mrocklin 306380 2015-06-27T01:08:13Z 2015-06-27T01:08:13Z MEMBER

@shoyer asked me to chime in in case this is an issue with dask. One thing to try would be to remove multi-threading from the equation. I'm not sure how this would affect things but it's worth a shot.

``` python

import dask from dask.async import get_sync dask.set_options(get=get_sync) # use single-threaded scheduler by default ... do work as normal ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
115925776 https://github.com/pydata/xarray/issues/444#issuecomment-115925776 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExNTkyNTc3Ng== shoyer 1217238 2015-06-27T00:49:19Z 2015-06-27T00:49:19Z MEMBER

do you have an example file? this might also be your HDF5 install....

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
115902800 https://github.com/pydata/xarray/issues/444#issuecomment-115902800 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExNTkwMjgwMA== shoyer 1217238 2015-06-26T22:01:41Z 2015-06-26T22:01:41Z MEMBER

Another backend to try would be engine='h5netcdf': https://github.com/shoyer/h5netcdf

That might help us identify if this is a netCDF4-python bug.

I am also baffled by how inserting isnull(arr1 & arr2) avoids the seg fault. This is a lazy computation created with dask that is immediately thrown away without accessing any of the values.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107
115887568 https://github.com/pydata/xarray/issues/444#issuecomment-115887568 https://api.github.com/repos/pydata/xarray/issues/444 MDEyOklzc3VlQ29tbWVudDExNTg4NzU2OA== shoyer 1217238 2015-06-26T21:25:50Z 2015-06-26T21:25:50Z MEMBER

Oh my, that's bad!

Can you experiment with the engine argument to open_mfdataset and see if that changes things? For example, try engine='scipy' (if this is a netcdf3 files) and engine='netcdf4'.

It would be also be helpful to report the dtypes of the arrays that trigger failure in array_equiv.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  segmentation fault with `open_mfdataset` 91184107

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 11.779ms · About: xarray-datasette