home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 412177726

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/2159#issuecomment-412177726 https://api.github.com/repos/pydata/xarray/issues/2159 412177726 MDEyOklzc3VlQ29tbWVudDQxMjE3NzcyNg== 35968931 2018-08-10T19:08:56Z 2018-08-11T00:09:28Z MEMBER

I've been looking through the functions open_mfdataset, auto_combine, _auto_concat and concat to see how one might go about achieving this in general.

The current behaviour isn't completely explicit, and I would like to check my understanding with a few questions:

1) If you concat two datasets along a dimension which doesn't have a coordinate, then concat will not be able to know what order to concatenate them in, so it just does it in the order they were provided?

2) Although auto_combine can determine the common dimension to concatenate datasets over, it doesn't know anything about insertion order! Even if the datasets have dimension coordinates, the line

python grouped = itertoolz.groupby(lambda ds: tuple(sorted(ds.data_vars)), datasets).values()

will only organise the datasets into groups according to the set of dimensions they have, it doesn't order the datasets within each group according to the values in the dimension coordinates?

We can show this because this (new) testcase fails:

```python @requires_dask def test_auto_combine_along_coords(self): # drop the third dimension to keep things relatively understandable data = create_test_data() for k in list(data.variables): if 'dim3' in data[k].dims: del data[k]

data_split1 = data.isel(dim2=slice(4))
data_split2 = data.isel(dim2=slice(4, None))
split_data = [data_split2, data_split1]  # Deliberately arrange datasets in wrong order
assert_identical(data, auto_combine(split_data, 'dim2'))

```

with output

E AssertionError: <xarray.Dataset> E Dimensions: (dim1: 8, dim2: 9, dim3: 10, time: 20) E Coordinates: E * time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-20 E * dim2 (dim2) float64 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 E Dimensions without coordinates: dim1, dim3 E Data variables: E var1 (dim1, dim2) float64 1.473 1.363 -1.192 ... 0.2341 -0.3403 0.405 E var2 (dim1, dim2) float64 -0.7952 0.7566 0.2468 ... -0.6822 1.455 0.7314 E <xarray.Dataset> E Dimensions: (dim1: 8, dim2: 9, time: 20) E Coordinates: E * time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-20 E * dim2 (dim2) float64 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 E Dimensions without coordinates: dim1 E Data variables: E var1 (dim1, dim2) float64 1.496 -1.834 -0.6588 ... 1.326 0.6805 -0.2999 E var2 (dim1, dim2) float64 0.7926 -1.063 0.1062 ... -0.447 -0.8955

3) So the call to _auto_concat just assumes that the datasets are provided in the correct order:

python concatenated = [_auto_concat(ds, dim=dim, data_vars=data_vars, coords=coords) for ds in grouped]

4) Therefore what needs to be done here is the groupby call needs to be replaced with something that actually orders the datasets according to the value in the dimension coordinates, works in N dimensions, and outputs a structure of datasets upon which _auto_concat can be called repeatedly, along every concatenation dimension?

Also, concat has a positions argument, which allows you to manually specify the concatenation order, but it's not used at all by auto_combine. In the main use case imagined here (concatenating the domains of multi-parallel simulation output) then the user will know the desired positions of each dataset, because it will correspond to how they divided up their domain in the first place. Perhaps an easier approach to providing for that use case would be to propagate the positions argument upwards so that the user can do something like

```python

User specifies how they split up their domain

domain_decomposition_structure = how_was_this_parallelized('output.*.nc')

Feeds this info into open_mfdataset

full_domain = xr.open_mfdataset('output.*.nc', positions=domain_decomposition_structure) ``` This approach would be much less general but would dodge the issue of writing generalized N-D auto-concatenation logic.

Final point - this common use case also has the added complexity of having ghost or guard cells around every dataset, which should be thrown away. Clearly some user input is required here (ghost_cells_x=2, ghost_cells_y=2, ghost_cells_z=0, ...), but I'm really not sure what the best way to fit that kind of logic in is. Yet more arguments to open_mfdataset?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  324350248
Powered by Datasette · Queries took 0.847ms · About: xarray-datasette