issue_comments: 255286001

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/pull/1024#issuecomment-255286001	https://api.github.com/repos/pydata/xarray/issues/1024	255286001	MDEyOklzc3VlQ29tbWVudDI1NTI4NjAwMQ==	1217238	2016-10-21T03:36:01Z	2016-10-21T03:36:01Z	MEMBER	I've been thinking about this... Maybe the simple, clean solution is to simply invoke compute() on all coords as soon as they are assigned to the DataArray / Dataset? I'm nervous about eager loading, especially for non-index coordinates. They can have more than one dimension, and thus can contain a lot of data. So potentially eagerly loading non-index coordinates could break existing use cases. On the other hand, non-index coordinates indeed checked for equality in most xarray operations (e.g., for the coordinate merge in align). So it is indeed useful not to have to recompute them all the time. Even eagerly loading indexes is potentially problematic, if loading the index values is expensive. So I'm conflicted: - I like the current caching behavior for `coords` and `indexes` - But I also want to avoid implicit conversions from dask to numpy, which is problematic for all the reasons you pointed out earlier I'm going to start throwing out ideas for how to deal with this: Option A Add two new (public?) methods, something like `.load_coords()` and `.load_indexes()`. We would then insert calls to these methods at the start of each function that uses coordinates: - `.load_indexes()`: `reindex`, `reindex_like`, `align` and `sel` - `.load_coords()`: `merge` and anything that calls the functions in `core/merge.py` (this indirectly includes `Dataset.__init__` and `Dataset.__setitem__`) Hypothetically, we could even have options for turning this caching systematically on/off (e.g., `with xarray.set_options(cache_coords=False, cache_indexes=True): ...`). Your proposal is basically an extreme version of this, where we call `.load_coords()` immediately after constructing every new object. Advantages: - It's fairly predictable when caching happens (especially if we opt for calling `.load_cords()` immediately, as you propose). - Computing variables is all done at once, which is much more performant than what we currently do, e.g., loading variables as needed for `.equals()` checks in `merge_variables` one at a time. Downsides: - Caching is more aggressive than necessary -- we cache indexes even if that coord isn't actually indexed. Option B Like Option A, but someone infer the full set of variables that need to be cached (e.g., in a `.merge()` operation) before it's actually done. This seems hard, but maybe is possible using a variation of `merge_variables`. This solves the downside of A, but diminishes the predictability. We're basically back to how things work now. Option C Cache dask.array in `IndexVariable` but not `Variable`. This preserves performance for repeated indexing, because the hash table behind the `pandas.Index` doesn't get thrown away. Advantages: - Much simpler and easier to implement than the alternatives. - Implicit conversions are greatly diminished. Downsides: - Non-index coordinates get thrown away after being evaluated once. If you're doing lots of operations of the form `[ds + other for ds in datasets]` where `ds` and `other` has conflicting coordinates this would probably make you unhappy. Option D Load the contents of an `IndexVariable` immediately and eagerly. They no longer cache data or use lazy loading. This has the most predictable performance, but might cause trouble for some edge use cases? I need to think about this a little more, but right now I am leaning towards Option C or D.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		180451196