issue_comments: 412177726

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/2159#issuecomment-412177726	https://api.github.com/repos/pydata/xarray/issues/2159	412177726	MDEyOklzc3VlQ29tbWVudDQxMjE3NzcyNg==	35968931	2018-08-10T19:08:56Z	2018-08-11T00:09:28Z	MEMBER	I've been looking through the functions `open_mfdataset`, `auto_combine`, `_auto_concat` and `concat` to see how one might go about achieving this in general. The current behaviour isn't completely explicit, and I would like to check my understanding with a few questions: 1) If you `concat` two datasets along a dimension which doesn't have a coordinate, then `concat` will not be able to know what order to concatenate them in, so it just does it in the order they were provided? 2) Although `auto_combine` can determine the common dimension to concatenate datasets over, it doesn't know anything about insertion order! Even if the datasets have dimension coordinates, the line `python grouped = itertoolz.groupby(lambda ds: tuple(sorted(ds.data_vars)), datasets).values()` will only organise the datasets into groups according to the set of dimensions they have, it doesn't order the datasets within each group according to the values in the dimension coordinates? We can show this because this (new) testcase fails: ```python @requires_dask def test_auto_combine_along_coords(self): # drop the third dimension to keep things relatively understandable data = create_test_data() for k in list(data.variables): if 'dim3' in data[k].dims: del data[k] `data_split1 = data.isel(dim2=slice(4)) data_split2 = data.isel(dim2=slice(4, None)) split_data = [data_split2, data_split1] # Deliberately arrange datasets in wrong order assert_identical(data, auto_combine(split_data, 'dim2'))` ``` with output E AssertionError: <xarray.Dataset> E Dimensions: (dim1: 8, dim2: 9, dim3: 10, time: 20) E Coordinates: E * time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-20 E * dim2 (dim2) float64 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 E Dimensions without coordinates: dim1, dim3 E Data variables: E var1 (dim1, dim2) float64 1.473 1.363 -1.192 ... 0.2341 -0.3403 0.405 E var2 (dim1, dim2) float64 -0.7952 0.7566 0.2468 ... -0.6822 1.455 0.7314 E <xarray.Dataset> E Dimensions: (dim1: 8, dim2: 9, time: 20) E Coordinates: E * time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-20 E * dim2 (dim2) float64 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 E Dimensions without coordinates: dim1 E Data variables: E var1 (dim1, dim2) float64 1.496 -1.834 -0.6588 ... 1.326 0.6805 -0.2999 E var2 (dim1, dim2) float64 0.7926 -1.063 0.1062 ... -0.447 -0.8955 3) So the call to `_auto_concat` just assumes that the datasets are provided in the correct order: `python concatenated = [_auto_concat(ds, dim=dim, data_vars=data_vars, coords=coords) for ds in grouped]` 4) Therefore what needs to be done here is the `groupby` call needs to be replaced with something that actually orders the datasets according to the value in the dimension coordinates, works in N dimensions, and outputs a structure of datasets upon which `_auto_concat` can be called repeatedly, along every concatenation dimension? Also, `concat` has a `positions` argument, which allows you to manually specify the concatenation order, but it's not used at all by `auto_combine`. In the main use case imagined here (concatenating the domains of multi-parallel simulation output) then the user will know the desired positions of each dataset, because it will correspond to how they divided up their domain in the first place. Perhaps an easier approach to providing for that use case would be to propagate the `positions` argument upwards so that the user can do something like ```python User specifies how they split up their domain domain_decomposition_structure = how_was_this_parallelized('output..nc') Feeds this info into open_mfdataset full_domain = xr.open_mfdataset('output..nc', positions=domain_decomposition_structure) ``` This approach would be much less general but would dodge the issue of writing generalized N-D auto-concatenation logic. Final point - this common use case also has the added complexity of having ghost or guard cells around every dataset, which should be thrown away. Clearly some user input is required here (`ghost_cells_x=2, ghost_cells_y=2, ghost_cells_z=0, ...`), but I'm really not sure what the best way to fit that kind of logic in is. Yet more arguments to `open_mfdataset`?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		324350248