issues: 304201107

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
304201107	MDU6SXNzdWUzMDQyMDExMDc=	1981	use dask to open datasets in parallel	2443309	closed	0			5	2018-03-11T22:33:52Z	2018-04-20T12:04:23Z	2018-04-20T12:04:23Z	MEMBER				Code Sample, a copy-pastable example if possible `python xr.open_mfdataset('path/to/many/files.nc', method='parallel')` Problem description We have many issues describing the less than stelar performance of open_mfdataset (e.g. #511, #893, #1385, #1788, #1823). The problem can be broken into three pieces: 1) open each file, 2) decode/preprocess each datasets, and 3) merge/combine/concat the collection of datasets. We can perform (1) and (2) in parallel (performance improvements to (3) would be a separate task). Lately, I'm finding that for large numbers of files, it can take many seconds to many minutes just to open all the files in a multi-file dataset of mine. I'm proposing that we use something like `dask.bag` to parallelize steps (1) and (2). I've played around with this a bit and it "works" almost right out of the box, provided you are using the "autoclose=True" option. A concrete example: We could change the line: `Python datasets = [open_dataset(p, open_kwargs) for p in paths]` to `Python import dask.bag as db paths_bag = db.from_sequence(paths) datasets = paths_bag.map(open_dataset, *open_kwargs).compute()` I'm curious what others think of this idea and what the potential downfalls may be.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1981/reactions", "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

0 rows from issues_id in issues_labels
5 rows from issue in issue_comments