pull_requests: 229885276
This data as json
id | node_id | number | state | locked | title | user | body | created_at | updated_at | closed_at | merged_at | merge_commit_sha | assignee | milestone | draft | head | base | author_association | auto_merge | repo | url | merged_by |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
229885276 | MDExOlB1bGxSZXF1ZXN0MjI5ODg1Mjc2 | 2553 | closed | 0 | Feature: N-dimensional auto_combine | 35968931 | ### What I did Generalised the `auto_combine()` function to be able to concatenate and merge datasets along any number of dimensions, instead of just one. Provides one solution to #2159, and relevant for discussion in #2039. Currently it cannot deduce the order in which datasets should be concatenated along any one dimension from the coordinates, so it just concatenates them in order they are supplied. This means for an N-D concatenation the datasets have to be supplied as a list of lists, which is nested as many times as there are dimensions to be concatenated along. ### How it works In `_infer_concat_order_from_nested_list()` the nested list of datasets is recursively traversed in order to create a dictionary of datasets, where the keys are the corresponding "tile IDs". These tile IDs are tuples serving as multidimensional indexes for the position of the dataset within the hypercube of all datasets which are to be combined. For example four datasets which are to be combined along two dimensions would be supplied as ```python datasets = [[ds0, ds1], [ds2, ds3]] ``` and given tile_IDs to be stored as ```python combined_ids = {(0, 0): ds0, (0, 1): ds1, (1, 0): ds2, (1, 1): ds3} ``` Using this unambiguous intermediate structure means that another method could be used to organise the datasets for concatenation (i.e. reading the values of their coordinates), and a new keyword argument `infer_order_from_coords` used to choose the method. The `_combine_nd()` function concatenates along one dimension at a time, reducing the length of the tile_ID tuple by one each time `_combine_along_first_dim()` is called. After each concatenation the different variables are merged, so the new `auto_combine()` is essentially like calling the old one once for each dimension in `concat_dims`. ### Still to do I would like people's opinions on the method I've chosen to do this, and any feedback on the code quality would be appreciated. Assuming we're happy with the method I used here, then the remaining tasks include: - [x] More tests of the final `auto_combine()` function - [x] ~~Add option to deduce concatenation order from coords (or this could be a separate PR)~~ - [x] Integrate this all the way up to `open_mfdataset()`. - [x] Unit tests for `open_mfdataset()` - [x] More tests that the user has inputted a valid structure of datasets - [x] ~~Possibly parallelize the concatenation step?~~ - [x] A few other small `TODO`s which are in `combine.py` - [x] Proper documentation showing how the input should be structured. - [x] Fix failing unit tests on python 2.7 (though support for 2.7 is being dropped at the end of 2018?) - [x] Fix failing unit tests on python 3.5 - [x] Update what's new This PR was intended to solve the common use case of collecting output from a simulation which was parallelized in multiple dimensions. I would like to write a tutorial about how to use xarray to do this, including examples of how to preprocess the data and discard processor ghost cells. | 2018-11-10T11:40:48Z | 2018-12-13T17:16:16Z | 2018-12-13T17:15:57Z | 2018-12-13T17:15:56Z | 9e8707d2041cfa038c31fc2284c1fe40bc3368e9 | 0 | ebbe47f450ed4407655bd9a4ed45274b140452dd | 0d6056e8816e3d367a64f36c7f1a5c4e1ce4ed4e | MEMBER | 13221727 | https://github.com/pydata/xarray/pull/2553 |
Links from other tables
- 0 rows from pull_requests_id in labels_pull_requests