home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 255794868

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/798#issuecomment-255794868 https://api.github.com/repos/pydata/xarray/issues/798 255794868 MDEyOklzc3VlQ29tbWVudDI1NTc5NDg2OA== 1217238 2016-10-24T16:40:09Z 2016-10-24T16:40:09Z MEMBER

@mrocklin OK, that makes sense. In that case, we might indeed need to thread this through xarray's backends.

Currently, backends open a file (e.g., with netCDF4.Dataset) and create an OrderedDict of xarray.Variable objects with lazy arrays that load from the file on demand. To load this data with dask, pass these lazy arrays into dask.array.from_array.

This currently doesn't use dask.delayed for three reasons: 1. Historical: we wrote this system before dask existed. 2. Performance: our LazilyIndexedArray class is still more selective than dask.array for subsetting data from large chunks, which is essential for many interactive use cases. Despite getitem fusing, dask will sometimes load complete chunks. This is particularly true if we do some transformation of the array, of the sort that could be accomplished with dask's map_blocks. Using LazilyIndexedArray ensures that this only gets applied to loaded data. There are also performance benefits to keeping files open when possible (discussed above). 3. Dependencies: dask is still an optional dependency for xarray. I'd like to keep it that way, if possible.

It seems like a version of xarray's backends that doesn't always open files immediately would make it suitable for use in dask.distributed. So indeed, we'll need to do some serious refactoring.

One other thing that will need to be tackled eventually: xarray.merge and xarray.concat (used in open_mfdataset) still have some steps (checking for equality between arrays) that are applied sequentially. This is going to be a performance bottleneck when we start working with very large arrays. This really should be refactored such that dask can do these evaluations in a single step, rather than once per object. For now, this can be avoided in concat by using the data_vars/coords options.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  142498006
Powered by Datasette · Queries took 0.465ms · About: xarray-datasette