home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 304201107

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
304201107 MDU6SXNzdWUzMDQyMDExMDc= 1981 use dask to open datasets in parallel 2443309 closed 0     5 2018-03-11T22:33:52Z 2018-04-20T12:04:23Z 2018-04-20T12:04:23Z MEMBER      

Code Sample, a copy-pastable example if possible

python xr.open_mfdataset('path/to/many/files*.nc', method='parallel')

Problem description

We have many issues describing the less than stelar performance of open_mfdataset (e.g. #511, #893, #1385, #1788, #1823). The problem can be broken into three pieces: 1) open each file, 2) decode/preprocess each datasets, and 3) merge/combine/concat the collection of datasets. We can perform (1) and (2) in parallel (performance improvements to (3) would be a separate task). Lately, I'm finding that for large numbers of files, it can take many seconds to many minutes just to open all the files in a multi-file dataset of mine.

I'm proposing that we use something like dask.bag to parallelize steps (1) and (2). I've played around with this a bit and it "works" almost right out of the box, provided you are using the "autoclose=True" option. A concrete example:

We could change the line: Python datasets = [open_dataset(p, **open_kwargs) for p in paths] to Python import dask.bag as db paths_bag = db.from_sequence(paths) datasets = paths_bag.map(open_dataset, **open_kwargs).compute()

I'm curious what others think of this idea and what the potential downfalls may be.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1981/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 5 rows from issue in issue_comments
Powered by Datasette · Queries took 0.625ms · About: xarray-datasette