issue_comments: 255794868

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/798#issuecomment-255794868	https://api.github.com/repos/pydata/xarray/issues/798	255794868	MDEyOklzc3VlQ29tbWVudDI1NTc5NDg2OA==	1217238	2016-10-24T16:40:09Z	2016-10-24T16:40:09Z	MEMBER	@mrocklin OK, that makes sense. In that case, we might indeed need to thread this through xarray's backends. Currently, backends open a file (e.g., with `netCDF4.Dataset`) and create an OrderedDict of `xarray.Variable` objects with lazy arrays that load from the file on demand. To load this data with dask, pass these lazy arrays into `dask.array.from_array`. This currently doesn't use `dask.delayed` for three reasons: 1. Historical: we wrote this system before dask existed. 2. Performance: our `LazilyIndexedArray` class is still more selective than `dask.array` for subsetting data from large chunks, which is essential for many interactive use cases. Despite getitem fusing, dask will sometimes load complete chunks. This is particularly true if we do some transformation of the array, of the sort that could be accomplished with dask's `map_blocks`. Using `LazilyIndexedArray` ensures that this only gets applied to loaded data. There are also performance benefits to keeping files open when possible (discussed above). 3. Dependencies: dask is still an optional dependency for xarray. I'd like to keep it that way, if possible. It seems like a version of xarray's backends that doesn't always open files immediately would make it suitable for use in dask.distributed. So indeed, we'll need to do some serious refactoring. One other thing that will need to be tackled eventually: `xarray.merge` and `xarray.concat` (used in `open_mfdataset`) still have some steps (checking for equality between arrays) that are applied sequentially. This is going to be a performance bottleneck when we start working with very large arrays. This really should be refactored such that dask can do these evaluations in a single step, rather than once per object. For now, this can be avoided in `concat` by using the `data_vars`/`coords` options.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		142498006