home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 142498006

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
142498006 MDU6SXNzdWUxNDI0OTgwMDY= 798 Integration with dask/distributed (xarray backend design) 4295853 closed 0     59 2016-03-21T23:18:02Z 2019-01-13T04:12:32Z 2019-01-13T04:12:32Z CONTRIBUTOR      

Dask (https://github.com/dask/dask) currently provides on-node parallelism for medium-size data problems. However, large climate data sets will require multiple-node parallelism to analyze large climate data sets because this constitutes a big data problem. A likely solution to this issue is integration of distributed (https://github.com/dask/distributed) with dask. Distributed is now integrated with dask and its benefits are already starting to be realized, e.g., see http://matthewrocklin.com/blog/work/2016/02/26/dask-distributed-part-3.

Thus, this issue is designed to identify the steps needed to perform this integration, at a high-level. As stated by @shoyer, it will

definitely require some refactoring of the xarray backend system to make this work cleanly, but that's OK -- the xarray backend system is indicated as experimental/internal API precisely because we hadn't figured out all the use cases yet."

To be honest, I've never been entirely happy with the design we took there (we use inheritance rather than composition for backend classes), but we did get it to work for our use cases. Some refactoring with an eye towards compatibility with dask distributed seems like a very worthwhile endeavor. We do have the benefit of a pretty large test suite covering existing use cases.

Thus, we have the chance to make xarray big-data capable as well as provide improvements to the backend.

To this end, I'm starting this issue to help begin the design process following the xarray mailing list discussion some of us have been having (@shoyer, @mrocklin, @rabernat).

Task To Do List: - [x] Verify asynchronous access error for to_netcdf output is resolved (e.g., https://github.com/pydata/xarray/issues/793) - [x] LRU-cached file IO supporting serialization to robustly support HDF/NetCDF reads

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/798/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 59 rows from issue in issue_comments
Powered by Datasette · Queries took 0.507ms · About: xarray-datasette