issues: 771127744
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
771127744 | MDU6SXNzdWU3NzExMjc3NDQ= | 4710 | open_mfdataset -> to_netcdf() randomly leading to dead workers | 10050469 | closed | 0 | 4 | 2020-12-18T19:42:14Z | 2020-12-22T11:54:37Z | 2020-12-22T11:54:37Z | MEMBER | This is: - xarray: 0.16.2 - dask: 2.30.0 I'm not sure a github issue is the right place to report this, but I'm not sure where else, so here it is. I just had two very long weeks of debugging stalled (i.e. "dead") OGGM jobs in a cluster environment. I finally nailed it down to
Most of the time, this command works just fine. But in 30% of the cases, this would just... stop and stall. One or more of the workers would simply stop working without coming back or erroring. What I can give as additional information:
- changing Is this is some kind of weird interaction between our own multiprocessing and dask? Is it more an IO problem that occurs only on cluster? I don't know. I know this is a crappy bug report, but the fact that I lost a lot of time on this recently has gone on my nerves :wink: (I'm mostly angry at myself for taking so long to find out that these two lines were the problem). In order to make a question out of this crappy report: how can I possibly debug this? I solved my problem now (with cc @TimoRoth our cluster IT whom I annoyed a lot before finding out that the problem was in xarray/dask |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/4710/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |