issues: 771127744

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
771127744	MDU6SXNzdWU3NzExMjc3NDQ=	4710	open_mfdataset -> to_netcdf() randomly leading to dead workers	10050469	closed	0			4	2020-12-18T19:42:14Z	2020-12-22T11:54:37Z	2020-12-22T11:54:37Z	MEMBER				This is: - xarray: 0.16.2 - dask: 2.30.0 I'm not sure a github issue is the right place to report this, but I'm not sure where else, so here it is. I just had two very long weeks of debugging stalled (i.e. "dead") OGGM jobs in a cluster environment. I finally nailed it down to `ds.to_netcdf(path)` in this situation: `python with xr.open_mfdataset(tmp_paths, combine='nested', concat_dim='rgi_id') as ds: ds.to_netcdf(path)` `tmp_paths` are a few netcdf files (from 2 to about 60). The combined dataset is nothing close to big (a few hundred MB at most). Most of the time, this command works just fine. But in 30% of the cases, this would just... stop and stall. One or more of the workers would simply stop working without coming back or erroring. What I can give as additional information: - changing `ds.to_netcdf(path)` to `ds.load().to_netcdf(path)` solves the problem - the problem became worse (i.e. more often) when the files to concatenate increased in the number of variables (the final size of the concatenated file doesn't seem to matter at all, it occurs also with files < 1 MB) - I can't reproduce the problem locally. The files are here if someone's interested, but I don't think the files are the issue here. - the files use gzip compression - On cluster, we are dealing with 64 core nodes, which do a lot of work before arriving to these two lines. We use python multiprocessing ourselves before that, create our own pool and use it, etc. But at the moment the job hits these two lines, no other job is running. Is this is some kind of weird interaction between our own multiprocessing and dask? Is it more an IO problem that occurs only on cluster? I don't know. I know this is a crappy bug report, but the fact that I lost a lot of time on this recently has gone on my nerves :wink: (I'm mostly angry at myself for taking so long to find out that these two lines were the problem). In order to make a question out of this crappy report: how can I possibly debug this? I solved my problem now (with `ds.load()`), but this is not really satisfying. Any tip is appreciated! cc @TimoRoth our cluster IT whom I annoyed a lot before finding out that the problem was in xarray/dask	{ "url": "https://api.github.com/repos/pydata/xarray/issues/4710/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

1 row from issues_id in issues_labels
4 rows from issue in issue_comments