id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 2054280736,I_kwDOAMm_X856cdYg,8572,Track merging datatree into xarray,35968931,open,0,,,27,2023-12-22T17:37:20Z,2024-05-02T19:44:29Z,,MEMBER,,,,"### What is your issue? Master issue to track progress of merging [xarray-datatree](https://github.com/xarray-contrib/datatree) into xarray `main`. Would close https://github.com/pydata/xarray/issues/4118 (and many similar issues), as well as one of the goals of our [development roadmap](https://docs.xarray.dev/en/stable/roadmap.html#tree-like-data-structure). Also see the [project board for DataTree integration](https://github.com/pydata/xarray/projects/9). --- On calls in the last few [dev meetings](https://github.com/pydata/xarray/issues/4001), we decided to forget about a temporary cross-repo `from xarray import datatree` (so this issue supercedes #7418), and just begin merging datatree into xarray main directly. ## Weekly meeting See https://github.com/pydata/xarray/issues/8747 ## Task list: To happen in order: - [x] **`open_datatree` in xarray.** This doesn't need to be performant initially, and ~~it would initially return a `datatree.DataTree` object.~~ EDIT: We decided it should return an `xarray.DataTree` object, or even `xarray.core.datatree.DataTree` object. So we can start by just copying the basic version in `datatree/io.py` right now which just calls `open_dataset` many times. #8697 - [x] **Triage and fix issues**: figure out which of the issues on xarray-contrib/datatree need to be fixed *before* the merge (if any). - [ ] **Merge in code for `DataTree` class.** I suggest we do this by making one PR for each module, and ideally discussing and merging each before opening a PR for the next module. (Open to other workflow suggestions though.) The main aim here being lowering the bus factor on the code, confirming high-level design decisions, and improving details of the implementation as it goes in. Suggested order of modules to merge: - [x] `datatree/treenode.py` - defines the tree structure, without any dimensions/data attached, #8757 - [x] `datatree/datatree.py` - adds data to the tree structure, #8789 - [x] `datatree/iterators.py` - iterates over a single tree in various ways, currently copied from [anytree](https://github.com/c0fec0de/anytree), #8879 - [x] `datatree/mapping.py` - implements `map_over_subtree` by iterating over N trees at once https://github.com/pydata/xarray/pull/8948, - [ ] `datatree/ops.py` - uses `map_over_subtree` to map methods like `.mean` over whole trees (https://github.com/pydata/xarray/pull/8976), - [x] `datatree/formatting_html.py` - HTML repr, works but could do with some [optimization](https://github.com/xarray-contrib/datatree/issues/206) https://github.com/pydata/xarray/pull/8930, - [x] `datatree/{extensions/common}.py` - miscellaneous other features e.g. attribute-like access (#8967). - [ ] **Expose datatree API publicly.** Actually expose `open_datatree` and `DataTree` in xarray's public API as top-level imports. The full list of things to expose is: - [ ] `open_datatree` - [ ] `DataTree` - [ ] `map_over_subtree` - [ ] `assert_isomorphic` - [ ] `register_datatree_accessor` - [ ] **Refactor class inheritance** - `Dataset`/`DataArray` share some mixin classes (e.g. `DataWithCoords`), and we could probably refactor `DataTree` to use these too. This is low-priority but would reduce code duplication. Can happen basically at any time or maybe in parallel with other efforts: - [ ] **Generalize backends to support groups.** Once a basic version of `xr.open_datatree` exists, we can start refactoring xarray's backend classes to support a general `Backend.open_datatree` method for any backend that can open multiple groups. Then we can make sure this is more performant than the naive implementation, i.e. only opening the file once. See also #8994. - [ ] **Support backends other than netCDF and Zarr.** - e.g. grib, see https://github.com/pydata/xarray/pull/7437, - [ ] **Support dask properly** - Issue https://github.com/xarray-contrib/datatree/pull/97 and the (stale) PR https://github.com/xarray-contrib/datatree/pull/196 are about dask parallelization over separate nodes in the tree. - [ ] **Add other new high-level API methods** - Things like [`.reorder_nodes`](https://github.com/xarray-contrib/datatree/pull/271) and ideas we've only discussed like https://github.com/xarray-contrib/datatree/issues/79 and https://github.com/xarray-contrib/datatree/issues/254 (cc @dcherian who has had useful ideas here) - [ ] **Copy xarray-contrib/datatree issues over to xarray's main repository.** I think this is quite important and worth doing as a record of why decisions were made. (@jhamman and @TomNicholas) - [ ] Copy over any recent bug fixes from original `datatree` repository - [x] **Look into merging commit history of xarray-contrib/datatree.** I think this would be cool but is less important than keeping the issues. (@jhamman suggested we could do this using some git wizardry that I hadn't heard of before) - [ ] **`xarray.tutorial.open_datatree`** - I've been meaning to make a tutorial datatree object for ages. There's an [issue about it](https://github.com/xarray-contrib/datatree/issues/100), but actually now I think something close to the CMIP6 ensemble data that @jbusecke and I used in our [pangeo blog post](https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114) would already be pretty good. Once we have this it becomes much easier to write docs about some advanced features. - [ ] **Merge Docs** - I've tried to write these pages so that they should slot neatly into xarray's existing docs structure. Careful reading, additions and improvements would be great though. Summary of what docs exist on this issue https://github.com/xarray-contrib/datatree/issues/61 - [ ] Write a blog post on the [xarray blog](https://xarray.dev/blog) highlighting xarray's new functionality, and explicitly thanking the NASA team for their work. Doesn't have to be long, it can just point to the documentation. --- Anyone is welcome to help with any of this, including but not limited to @owenlittlejohns , @eni-awowale, @flamingbear (@etienneschalk maybe?). cc also @shoyer @keewis for any thoughts as to the process.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8572/reactions"", ""total_count"": 7, ""+1"": 6, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 1}",,,13221727,issue 324350248,MDU6SXNzdWUzMjQzNTAyNDg=,2159,Concatenate across multiple dimensions with open_mfdataset,35968931,closed,0,,,27,2018-05-18T10:10:49Z,2019-09-16T18:54:39Z,2019-06-25T15:50:33Z,MEMBER,,,,"#### Code Sample ```python # Create 4 datasets containing sections of contiguous (x,y) data for i, x in enumerate([1, 3]): for j, y in enumerate([10, 40]): ds = xr.Dataset({'foo': (('x', 'y'), np.ones((2, 3)))}, coords={'x': [x, x+1], 'y': [y, y+10, y+20]}) ds.to_netcdf('ds.' + str(i) + str(j) + '.nc') # Try to open them all in one go ds_read = xr.open_mfdataset('ds.*.nc') print(ds_read) ``` #### Problem description Currently ``xr.open_mfdataset`` will detect a single common dimension and concatenate DataSets along that dimension. However a common use case is a set of NetCDF files which have two or more common dimensions that need to be concatenated along simultaneously (for example collecting the output of any large-scale simulation which parallelizes in more than one dimension simultaneously). For the behaviour of ``xr.open_mfdataset`` to be n-dimensional it should automatically recognise and concatenate along all common dimensions. #### Expected Output ``` Dimensions: (x: 4, y: 6) Coordinates: * x (x) int64 1 2 3 4 * y (y) int64 10 20 30 40 50 60 Data variables: foo (x, y) float64 dask.array ``` #### Current output of ``xr.open_mfdataset()`` ``` Dimensions: (x: 4, y: 12) Coordinates: * x (x) int64 1 2 3 4 * y (y) int64 10 20 30 40 50 60 10 20 30 40 50 60 Data variables: foo (x, y) float64 dask.array ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2159/reactions"", ""total_count"": 4, ""+1"": 4, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue