html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/pull/1024#issuecomment-260436147,https://api.github.com/repos/pydata/xarray/issues/1024,260436147,MDEyOklzc3VlQ29tbWVudDI2MDQzNjE0Nw==,6213168,2016-11-14T19:28:38Z,2016-11-14T19:28:38Z,MEMBER,"Happy to contribute! On 14 Nov 2016 16:58, ""Stephan Hoyer"" notifications@github.com wrote: > Thanks for your patience! This is a nice improvement. > > I have an idea for a variation that might make for a cleaner (less dask > specific) way to handle the remaining caching logic -- I'll add you a > reviewer on that PR. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > https://github.com/pydata/xarray/pull/1024#issuecomment-260393416, or mute > the thread > https://github.com/notifications/unsubscribe-auth/AF7OML0y1xCCfg4j0o0OUxMJ8gBpEIB1ks5q-JMYgaJpZM4KLurN > . ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,180451196 https://github.com/pydata/xarray/pull/1024#issuecomment-260202467,https://api.github.com/repos/pydata/xarray/issues/1024,260202467,MDEyOklzc3VlQ29tbWVudDI2MDIwMjQ2Nw==,6213168,2016-11-13T18:21:26Z,2016-11-13T18:21:26Z,MEMBER,"Changed to cache IndexVariable._data on __init__. Please review... ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,180451196 https://github.com/pydata/xarray/pull/1024#issuecomment-260137073,https://api.github.com/repos/pydata/xarray/issues/1024,260137073,MDEyOklzc3VlQ29tbWVudDI2MDEzNzA3Mw==,6213168,2016-11-12T17:47:10Z,2016-11-12T17:47:10Z,MEMBER,"Finished and waiting for code review ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,180451196 https://github.com/pydata/xarray/pull/1024#issuecomment-258647829,https://api.github.com/repos/pydata/xarray/issues/1024,258647829,MDEyOklzc3VlQ29tbWVudDI1ODY0NzgyOQ==,6213168,2016-11-05T22:56:27Z,2016-11-05T22:56:27Z,MEMBER,"roger that, getting to work :) ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,180451196 https://github.com/pydata/xarray/pull/1024#issuecomment-256114879,https://api.github.com/repos/pydata/xarray/issues/1024,256114879,MDEyOklzc3VlQ29tbWVudDI1NjExNDg3OQ==,6213168,2016-10-25T18:11:35Z,2016-10-25T18:11:35Z,MEMBER,"Hi Stephen, Thank you for your thinking. IMHO option D is the cleanest and safest. Could you come up with any example where it may be problematic? On 21 Oct 2016 4:36 am, ""Stephan Hoyer"" notifications@github.com wrote: > I've been thinking about this... Maybe the simple, clean solution is to > simply invoke compute() on all coords as soon as they are assigned to the > DataArray / Dataset? > > I'm nervous about eager loading, especially for non-index coordinates. > They can have more than one dimension, and thus can contain a lot of data. > So potentially eagerly loading non-index coordinates could break existing > use cases. > > On the other hand, non-index coordinates indeed checked for equality in > most xarray operations (e.g., for the coordinate merge in align). So it is > indeed useful not to have to recompute them all the time. > > Even eagerly loading indexes is potentially problematic, if loading the > index values is expensive. > > So I'm conflicted: > - I like the current caching behavior for coords and indexes > - But I also want to avoid implicit conversions from dask to numpy, > which is problematic for all the reasons you pointed out earlier > > I'm going to start throwing out ideas for how to deal with this: > Option A > > Add two new (public?) methods, something like .load_coords() and > .load_indexes(). We would then insert calls to these methods at the start > of each function that uses coordinates: > - .load_indexes(): reindex, reindex_like, align and sel > - .load_coords(): merge and anything that calls the functions in > core/merge.py (this indirectly includes Dataset.**init** and > Dataset.**setitem**) > > Hypothetically, we could even have options for turning this caching > systematically on/off (e.g., with xarray.set_options(cache_coords=False, > cache_indexes=True): ...). > > Your proposal is basically an extreme version of this, where we call > .load_coords() immediately after constructing every new object. > > Advantages: > - It's fairly predictable when caching happens (especially if we opt > for calling .load_cords() immediately, as you propose). > - Computing variables is all done at once, which is much more > performant than what we currently do, e.g., loading variables as needed for > .equals() checks in merge_variables one at a time. > > Downsides: > - Caching is more aggressive than necessary -- we cache indexes even > if that coord isn't actually indexed. > > Option B > > Like Option A, but someone infer the full set of variables that need to be > cached (e.g., in a .merge() operation) before it's actually done. This > seems hard, but maybe is possible using a variation of merge_variables. > > This solves the downside of A, but diminishes the predictability. We're > basically back to how things work now. > Option C > > Cache dask.array in IndexVariable but not Variable. This preserves > performance for repeated indexing, because the hash table behind the > pandas.Index doesn't get thrown away. > > Advantages: > - Much simpler and easier to implement than the alternatives. > - Implicit conversions are greatly diminished. > > Downsides: > - Non-index coordinates get thrown away after being evaluated once. If > you're doing lots of operations of the form [ds + other for ds in > datasets] where ds and other has conflicting coordinates this would > probably make you unhappy. > > Option D > > Load the contents of an IndexVariable immediately and eagerly. They no > longer cache data or use lazy loading. > > This has the most predictable performance, but might cause trouble for > > ## some edge use cases? > > I need to think about this a little more, but right now I am leaning > towards Option C or D. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > https://github.com/pydata/xarray/pull/1024#issuecomment-255286001, or mute > the thread > https://github.com/notifications/unsubscribe-auth/AF7OMLBh4eDuKRNv0x5HwRie_yaGh0Yzks5q2DMjgaJpZM4KLurN > . ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,180451196 https://github.com/pydata/xarray/pull/1024#issuecomment-255066352,https://api.github.com/repos/pydata/xarray/issues/1024,255066352,MDEyOklzc3VlQ29tbWVudDI1NTA2NjM1Mg==,6213168,2016-10-20T10:14:30Z,2016-10-20T10:14:30Z,MEMBER,"ping - how do you prefer me to proceed? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,180451196 https://github.com/pydata/xarray/pull/1024#issuecomment-253133290,https://api.github.com/repos/pydata/xarray/issues/1024,253133290,MDEyOklzc3VlQ29tbWVudDI1MzEzMzI5MA==,6213168,2016-10-12T06:49:23Z,2016-10-12T06:49:23Z,MEMBER,"I've been thinking about this... Maybe the simple, clean solution is to simply invoke compute() on all coords as soon as they are assigned to the DataArray / Dataset? On 12 Oct 2016 02:18, ""Stephan Hoyer"" notifications@github.com wrote: > _@shoyer_ commented on this pull request. > > Apologies for the delay here -- my comments were stuck as a ""pending"" > GitHub review. > > I am still wondering what the right behavior is for variables used as > indexes. (These can be dask arrays, too.) > > I think there is a good case for skipping these variables in .chunk(), > but we probably _do_ want to make indexes still cache as pandas.Index > objects, because otherwise repeated evaluation of dask arrays to build the > > ## index for alignment or indexing gets expensive. > > In xarray/core/dataset.py > https://github.com/pydata/xarray/pull/1024#pullrequestreview-3794240: > > > @@ -792,13 +806,19 @@ def chunks(self): > > array. > > """""" > > chunks = {} > > - for v in self.variables.values(): > > - for v in self.data_vars.values(): > > I am concerned about skipping non-data_vars here. Coordinates could still > be chunked, e.g., if they were loaded from a file, or created directly from > > ## dask arrays. > > In xarray/core/dataset.py > https://github.com/pydata/xarray/pull/1024#pullrequestreview-3794240: > > > ``` > > if v.chunks is not None: > > new_chunks = list(zip(v.dims, v.chunks)) > > if any(chunk != chunks[d] for d, chunk in new_chunks > > if d in chunks): > > raise ValueError('inconsistent chunks') > > chunks.update(new_chunks) > > ``` > > - if chunks: > > I guess this method is inconsistent with Variable.chunks, but it > currently always returns a dict. > > ## I would either skip this change or use something like my version. > > In xarray/core/dataset.py > https://github.com/pydata/xarray/pull/1024#pullrequestreview-3794240: > > > @@ -851,6 +871,9 @@ def selkeys(dict_, keys): > > return dict((d, dict_[d]) for d in keys if d in dict_) > > ``` > def maybe_chunk(name, var, chunks): > ``` > - if name not in self.data_vars: > > I see your point about performance, but I think that mostly holds true for > indexes. So I would be inclined to adjust this to only skip variables in > self.dims (aka indexes used for alignment). > > I am still concerned about skipping coords if they are already dask > arrays. If they are already dask arrays, then .chunk() should probably > adjust their chunks anyways. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > https://github.com/pydata/xarray/pull/1024#pullrequestreview-3794240, > or mute the thread > https://github.com/notifications/unsubscribe-auth/AF7OMBL7_F2IV5P04Em8NhPy-K8aNrGZks5qzDVlgaJpZM4KLurN > . ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,180451196 https://github.com/pydata/xarray/pull/1024#issuecomment-252033741,https://api.github.com/repos/pydata/xarray/issues/1024,252033741,MDEyOklzc3VlQ29tbWVudDI1MjAzMzc0MQ==,6213168,2016-10-06T17:34:01Z,2016-10-06T17:34:01Z,MEMBER,"I can't reproduce the above failure test.test_conventions.TestEncodeCFVariable.testMethod=test_missing_fillvalue. I suspect it might be a random failure, because - it used to succeed until my latest commit, which eclusively changed the readme - it suceeds on python 3 ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,180451196 https://github.com/pydata/xarray/pull/1024#issuecomment-251983000,https://api.github.com/repos/pydata/xarray/issues/1024,251983000,MDEyOklzc3VlQ29tbWVudDI1MTk4MzAwMA==,6213168,2016-10-06T14:39:47Z,2016-10-06T14:39:47Z,MEMBER,"I added the disclaimer in the release notes. Is there any other outstanding issue? Thanks ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,180451196 https://github.com/pydata/xarray/pull/1024#issuecomment-251209654,https://api.github.com/repos/pydata/xarray/issues/1024,251209654,MDEyOklzc3VlQ29tbWVudDI1MTIwOTY1NA==,6213168,2016-10-03T19:58:53Z,2016-10-03T20:31:48Z,MEMBER,"What happened before this PR was that all coords were blindly converted to dask on chunk(). Then, the first time anything invoked the values property, e.g. Something as simple as `DataArray.__str__`., they were silently converted back to numpy. It wasn't _easy_ to accidentally get them in dask format; in fact no unit test noticed before my last commit. If you deliberately use a dask array as a coord, it won't be converted to numpy. However I can't think of any reason why anybody would want to do it in practice. I'll add it to the breaking changes as if somebody did do the above, the performance of his program will degrade with this release as his coord will risk being evaluated multiple times. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,180451196 https://github.com/pydata/xarray/pull/1024#issuecomment-250922373,https://api.github.com/repos/pydata/xarray/issues/1024,250922373,MDEyOklzc3VlQ29tbWVudDI1MDkyMjM3Mw==,6213168,2016-10-01T16:41:04Z,2016-10-01T17:37:39Z,MEMBER,"Well, crud. This introduces a regression where DataArray.chunk() converts the data _and the coords_ to dask. This becomes enormously problematic later on as pretty much nothing expects a dask-based coord. [edit] fixed below ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,180451196