html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/1092#issuecomment-873227866,https://api.github.com/repos/pydata/xarray/issues/1092,873227866,MDEyOklzc3VlQ29tbWVudDg3MzIyNzg2Ng==,1217238,2021-07-02T19:56:49Z,2021-07-02T19:56:49Z,MEMBER,There's a parallel discussion hierarchical storage going on over in https://github.com/pydata/xarray/issues/4118. I'm going to close this issue in favor of the other one just to keep the ongoing discussion in one place.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-868324949,https://api.github.com/repos/pydata/xarray/issues/1092,868324949,MDEyOklzc3VlQ29tbWVudDg2ODMyNDk0OQ==,7611856,2021-06-25T08:36:03Z,2021-06-25T08:45:23Z,NONE,"Hey Folks,
I stumbled over this discussion having a similar use case as described in some comments above: A `DataSet` with a bunch of arrays called `count_a, test_count_a, train_count_a,  count_b, ... , controlled_test_mean, controlled_train_mean, ... controlled_test_sigma, ...` Obviously a hierarchical structure would help to arrange this.

However, one point I didn't see in the discussion is the following:

Hierarchical structures often force a user to come up with some arbitrary order of hierarchy levels. The classical example is document filing: do you put your health insurance documents under `/insurance/health/2021`, `2021/health/insurance`,....?

One solution to that is a tagging of documents instead of putting them into a hierarchy. This would give the full flexibility to retrieve any flat `DataSet` out of a `TaggedDataSet` by specifying the set of tags that the individual `DataArrays` must be listed under.

Back to the above example, one could think of stuff like:

```python
# get a flat view (DataSet-like object) on all arrays of tagged that have the 'count' tag
ds: DataSet(View) = tagged.tag_select(""count"")
bar1 = ds.mean(dim=""foo"")
# get a flat view (DataSet-like object) on all arrays of tagged that have the ""train and ""controlled"" tag
bar2 = tagged.tag_select(""train"", ""controlled"").mean(dim=""foo"") # order of arguments to `tag_select` is irrelevant!
```
I hope it is clear what I mean, I know that there is e.g. some awesome [file system plugins](https://amoffat.github.io/supertag/index.html) (he has incredibly nice high level documentation on the topic) that use such a data model.

Just wanted to add that aspect to the discussion even if it might collide with the hierarchical approach!

One side note: If every array in the tagged container has exactly one tag, and tags do not repeat, then the whole thing should be semantically identical to a `DataSet` because every `tag_select` will yield a single `DataArray`  - I.e. it might be possible to integrate such functionality directly into `DataSet`  !?!

Regards,

Martin

","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-683824965,https://api.github.com/repos/pydata/xarray/issues/1092,683824965,MDEyOklzc3VlQ29tbWVudDY4MzgyNDk2NQ==,4441338,2020-08-31T14:45:22Z,2020-08-31T15:15:20Z,NONE,"I did a ctrl-f for zarr in this issue, found nothing, so here's my two cents: it should be possible to write a Datagroup with either  zarr or netcdf.
I wonder if @emilbiju (posted https://github.com/pydata/xarray/issues/4118 ) has any of that code laying around, could be a starting point.
In general, a tree structure to which I can broadcast operations in the same dimensions to different datasets that do not necessary share dimension lengths would solve my use case . This corresponds to bullet point number 3 in https://github.com/pydata/xarray/issues/1092#issuecomment-290159834 . My use case is a set of experiments, that have:  the same parameter variables, with different values; the same dimensions with different lengths for their data. The parameters and data would benefit of having a hierarchical naming structure.
Currently I build a master dataset containing experiment datasets, with a coordinate for each parameter. Then I map functions over it.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-571157096,https://api.github.com/repos/pydata/xarray/issues/1092,571157096,MDEyOklzc3VlQ29tbWVudDU3MTE1NzA5Ng==,26384082,2020-01-06T14:26:43Z,2020-01-06T14:26:43Z,NONE,"In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the `stale` label; otherwise it will be marked as closed automatically
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-363090775,https://api.github.com/repos/pydata/xarray/issues/1092,363090775,MDEyOklzc3VlQ29tbWVudDM2MzA5MDc3NQ==,601177,2018-02-05T13:53:55Z,2018-02-05T13:53:55Z,NONE,"I'm late to the discussion and may be repeating some things essentially already said, but I'd still like to add a further voice.

@shoyer said on 8 Nov 2016:
> I am reluctant to add the additional complexity of groups directly into the `xarray.Dataset` data model. For example, how do groups get updated when you slice, aggregate or concatenate datasets? The rules for coordinates are already pretty complex.

If you prepend the paths to all the names (of dimensions, coordinate variables, and variables) and use the resulting strings as names, don't you just get a collection that would fit right in a `xarray.Dataset`? (Perhaps I'm just repeating what @lamorton said on 28 Mar 2017.) My feeling is that the only thing missing would be coordinate variables defined in a group closer to the root of the hierarchy, but that needs to be dealt with anyway if you want to read from netCDF4 files correctly (see my #1888). My guess would be that having groups inherit all the dimensions/coordinates of their parent that they do not redefine should be the way to go.

My use case is data from a single metmast over time. There are various instruments measuring all kinds of variables of which 10-minute statistics are recorded. I use groups to keep an overview. (I use something like `/[wind|air|prec]/<device>/<variable>`, `/<device>/<variable>`, or `/<device>/<variable>/statistic` as a hierarchy.) Slicing along the time axis for the whole hierarchy would make perfect sense.

@shoyer said on 30 Mar 2017:
> We would possibly want to make another `Dataset` subclass for the sub-datasets to ensure that their variables are linked to the parent, e.g., `xarray.LinkedDataset`. This would enable `ds['flux']['poloidal'] = subset`.
>
> But I'm also not convinced this is actually worth the trouble given how easy it is to write `ds['flux', 'poloidal']`.

I would prefer the former option, as it more clearly shows the hierarchical nature. If also copying the netCDF4-path-separator-convention, then `ds['flux/poloidal']` is shorter than `ds['flux', 'poloidal']`. (Allowing `Dataset` or `DataArray` to have names that include '/' would be dangerous anyway in view of netCDF4 serialization.)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-290256766,https://api.github.com/repos/pydata/xarray/issues/1092,290256766,MDEyOklzc3VlQ29tbWVudDI5MDI1Njc2Ng==,4160723,2017-03-29T23:26:50Z,2017-03-29T23:26:50Z,MEMBER,"> How to handle dimensions and coordinate names when assigning groups is clearly one of the important design decisions here. It's obvious that data variables should be grouped but less clear how to handle dimensions/coordinates.

I would be +1 for allowing tuples for data variables names but not for dimensions/coordinates names. It indeed looks like that using tuples for the latter would be a greater source of confusion and would add too much complexity for only little (or no real?) benefit.

I'd be fine with raising an error when loading a netCDF4 file which have groups with conflicting dimensions or when assigning an incompatible Dataset as a new group (e.g., `ds['flux'] = incompatible_ds`).

For groups that share common dimensions/coordinates with some differences, a data structure built on top of `Dataset` (like `DatasetGroup` or `DatasetNode`) would be more appropriate I think.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-290250866,https://api.github.com/repos/pydata/xarray/issues/1092,290250866,MDEyOklzc3VlQ29tbWVudDI5MDI1MDg2Ng==,1217238,2017-03-29T22:54:17Z,2017-03-29T22:54:17Z,MEMBER,"> I'm also having thoughts about the attribute access: if ds['flux']['poloidal'] = subset does not work, then neither does ds.flux.poloidal = subset, correct? If so, it is almost pointless to have the attribute access in the first place.

Yes, this is correct. But note that `ds.flux = array` is not also supported -- only attribute access in xarray only works for getting, not setting. If you try it, you get an error message, e.g.,
```
AttributeError: cannot set attribute 'bar' on a 'DataArray' object. Use __setitem__ style assignment (e.g., `ds['name'] = ...`) instead to assign variables.
```

> A big difference between xarray and netCDF4-python datasets is that the children datasets in xarray can go have a life of their own, independent of their parent & the file it represents.

Yes, this is true. We would possibly want to make another Dataset subclass for the sub-datasets to ensure that their variables are linked to the parent, e.g., `xarray.LinkedDataset`. This would enable `ds['flux']['poloidal'] = subset`.

But I'm also not convinced this is actually worth the trouble given how easy it is to write `ds['flux', 'poloidal']`.

NumPy has similar issues, e.g., `x[i, j] = y` works but `x[i][j] = y` does not.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-290224441,https://api.github.com/repos/pydata/xarray/issues/1092,290224441,MDEyOklzc3VlQ29tbWVudDI5MDIyNDQ0MQ==,23484003,2017-03-29T21:00:42Z,2017-03-29T21:04:08Z,NONE,"@shoyer  I see your point about the string manipulation.  On the other hand, this is exactly how [h5py](http://docs.h5py.org/en/latest/high/group.html) and [netCDF4-python](http://unidata.github.io/netcdf4-python/#section2) implement the group/subgroup access syntax: just like a filepath. 

I'm also having thoughts about the attribute access: if `ds['flux']['poloidal'] = subset` does not work, then neither does `ds.flux.poloidal = subset`, correct? If so, it is almost pointless to have the attribute access in the first place.  I suppose that is the price to pay for merely making it _appear_ as though there is attribute-access.

For my own understanding, I tried to translate between `xarray` and `netCDF4-python` :
- `nc.Variable`      <-->  `xr.Variable`
- `nc.?????`     <-->  `xr.DataArray`  (netCDF doesn't distinguish vars/coords, so no analog is possible)
- `nc.Group`        <-->  `xr.NestableDataset` 
- `nc.Dataset`      <-->  `xr.NestableDataset`  

From [netCDF4-python](http://unidata.github.io/netcdf4-python/#netCDF4.Group)
>Groups define a hierarchical namespace within a netCDF file. They are analogous to directories in a unix filesystem. Each Group behaves like a Dataset within a Dataset, and can contain it's own variables, dimensions and attributes (and other Groups). Group inherits from Dataset, so all the Dataset class methods and variables are available to a Group instance (except the close method).

It appears that the only things special about a `nc.Dataset` as compared to an `nc.Group` are:
1. The file access is tied to the `nc.Dataset`.
2. The `nc.Dataset` group has children but no parent.

A big difference between `xarray` and `netCDF4-python` `datasets` is that the children `datasets` in `xarray` can go have a life of their own, independent of their parent & the file it represents. It makes sense to me to have just a single `xarray` type (modified version of `xarray.Dataset`) to deal with both of these cases. 

The `nc.Group` instances have an attribute `groups` that lists all the subgroups.  So one option I suppose would be to follow that route and actually have Datasets that contain other datasets alongside everything else.

As an aside, it seems that ragged arrays are now supported in [netCDF4-python:VLen](http://unidata.github.io/netcdf4-python/#section11).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-290165548,https://api.github.com/repos/pydata/xarray/issues/1092,290165548,MDEyOklzc3VlQ29tbWVudDI5MDE2NTU0OA==,1217238,2017-03-29T17:38:03Z,2017-03-29T17:38:49Z,MEMBER,"> The background flux is defined to be uniform in some coordinates, so it is lower-dimensionality than the total flux. It doesn't make sense to turn a 1-D variable into a 3-D variable just to match the others so I can put it into an array.

Yes, totally agreed, and I've encountered similar cases in my own work. These sort of ""ragged"" arrays are great use case for groups.

> Accessing via ds['flux','poloidal'] is a bit confusing because ds[] is (I think) a dictionary, but supplying multiple names is suggestive of either array indexing or getting a list with two things inside, flux and poloidal. That is, the syntax doesn't reflect the semantics very well.

Yes, it's a little confusing because it looks similar to `ds[['flux','poloidal']]`, which has different meaning. But otherwise programmatic access starts turning into a mess of string manipulation, e.g., `ds['flux', subgroup]` rather than `ds['flux/' + subgroup]`.

> If I am at the console, and I start typing ds.flux and use the tab-completion, does that end up creating a new dataset just so I can see what is inside ds.flux? Is that an expensive operation?

Yes, it would create a new dataset, which could take ~1 ms. That's slow for inner loops (though we could add caching to help), but plenty fast for interactive use.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-290159834,https://api.github.com/repos/pydata/xarray/issues/1092,290159834,MDEyOklzc3VlQ29tbWVudDI5MDE1OTgzNA==,23484003,2017-03-29T17:18:23Z,2017-03-29T17:19:19Z,NONE,"@darothen: Hmm, are your coordinate grids identical for each simulation (ie, `any(ds1.x != ds2.x)`  evaluates as false)?  
- If so, then it really does make sense to do what you described and create new dimensions for the experimental factors, on top of the spatial dimensions of the simulations.  
- If not, but the length of all the dimensions is the same, one could still keep all the simulations in the same dataset, one would just need to index the coordinates with the experimental factors as well. 
- Finally, if the shape of the coordinate arrays varies with the experimental factor (for instance, doing convergence studies with finer meshes), that violates the xarray data model for having a single set of dimensions, each of which has a fixed length throughout the dataset, in order to enable smart broadcasting by dimension name. If (and only if) the dimensions are changing length, it would be better to keep a collection of datasets in some other type of data structure.

It might work for my case to convert my 'tags' to indexes for new dimensions (ie, `ds.sel(quantity='flux',direction='poloidal',variation='perturbed'`). However, there are two issues:
1.  The background flux is defined to be uniform in some coordinates, so it is lower-dimensionality than the total flux. It doesn't make sense to turn a 1-D variable into a 3-D variable just to match the others so I can put it into an array.  This goes especially for scalars and metadata that really should not be turned into arrays, but do belong with the subsets.
2.  During my processing sequence, I may want to add something like `ds.flux.helical.background`. In order to do this, however, I'd be forced to define the 'perturbed' and 'total' helical fluxes at that time.  But often I don't want or need to compute these.  

There is still a good reason to have a flexible data model for lumping more heterogeneous collections together under some headings, with the potential for recursion. I suppose my question is, what is the most natural data model & corresponding access syntax?  
- Attribute-style access is convenient and idiomatic; it implies a tree-like structure. This probably makes the most sense.
- An alternative data model would be sets with subsets, which could be accessed by something similar to `ds.sel` but accepting set names as `*args` rather than `**kwargs`.  Then requesting members of some set could return a dataset with those members, and the new dataset would lack the membership flag for variables, much the way slicing reduces dimensionality. In fact, one could even keep a record of the applied set requests much like point axes.  A variable's key in `data_vars` would essentially just be a list/tuple of sets of which it is a member. Assignment would be tricky because it could create new sets, and the membership of existing elements in a new set would probably require user intervention to clarify...

@shoyer: Your approach is quite clever, and 'smells' much better than parsing strings.  I do have two quibbles though.
- Accessing via `ds['flux','poloidal']` is a bit confusing because `ds[]` is (I think) a dictionary, but supplying multiple names is suggestive of either array indexing or getting a list with two things inside,  `flux` and  `poloidal`.  That is, the syntax doesn't reflect the semantics very well. 
- If I am at the console, and I start typing `ds.flux` and use the tab-completion, does that end up creating a new dataset just so I can see what is inside `ds.flux`?  Is that an expensive operation?

[Edited for formatting]




","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-290142369,https://api.github.com/repos/pydata/xarray/issues/1092,290142369,MDEyOklzc3VlQ29tbWVudDI5MDE0MjM2OQ==,1217238,2017-03-29T16:18:12Z,2017-03-29T16:18:12Z,MEMBER,">  @shoyer do you have an idea on how it would work with serialization to netCDF?

With netCDF4, we could potentially just use groups. Or we could use some sort of naming convention for strings, e.g., joining together together the parts of the tuple with `.`.

One challenge here is that unless we also let dimensions be group specific, not every netCDF4 file with groups corresponds to a valid xarray Dataset: you can have conflicting sizes on dimensions for netCDF4 files in different groups.

In principle, it could be OK to use tuples for dimension names, but we already have lots of logic that distinguishes between single and multiple dimensions by looking for non-strings or tuples. So you would probably have to write `ds.sum(dim=[('flux', 'time')])` if you wanted to sum over the 'time' dimension of the 'flux' group. https://github.com/pydata/xarray/issues/1231 would help here (e.g., to enable `ds.sel({('flux', 'time'): time})`), but cases like `ds.sum(dim=('flux', 'time'))` would still be a source of confusion.

How to handle dimensions and coordinate names when assigning groups is clearly one of the important design decisions here. It's obvious that data variables should be grouped but less clear how to handle dimensions/coordinates.

> We would also have to decide how to display groups in the repr of the flat dataset...

Some sort of further indentation seems natural, possibly with truncation like `...` for cases when the number of variables is very long (>10), e.g.,
```
Data variables:
    flux
      poloidal
        perturbed
```

This is another case where an HTML repr could be powerful, allowing for clearer visual links and potentially interactive expanding/contracting of the tree.

> Would the domain for this just be to simulate the tree-like structure that NetCDF permits, or could it extend to multiple datasets on disk? 

From xarray's perspective, there isn't really a distinction between multiple files and groups in one netCDF file -- it's just a matter of creating a Dataset with data organized in a different way. Presumably we could write helper methods for converting a dimension into a group level (and vice-versa).

But it's worth noting that there still limitations to opening large numbers of files in a single dataset, even with groups, because xarray reads all the metadata for every variable into memory at once, and that metadata is copied in every xarray operation. For this reason, you will still probably want a different datastructure (convertible into an xarray.Dataset) when navigating very large datasets like CMIP, which consists of many thousands of files.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-290133148,https://api.github.com/repos/pydata/xarray/issues/1092,290133148,MDEyOklzc3VlQ29tbWVudDI5MDEzMzE0OA==,4992424,2017-03-29T15:47:57Z,2017-03-29T15:48:17Z,NONE,"Ah, thanks for the heads-up @benbovy! I see the difference now, and I agree
both approaches could co-exist. I may play around with building some of
your proposed `DatasetNode` functionality into my `Experiment` tool.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-290130241,https://api.github.com/repos/pydata/xarray/issues/1092,290130241,MDEyOklzc3VlQ29tbWVudDI5MDEzMDI0MQ==,4160723,2017-03-29T15:38:46Z,2017-03-29T15:38:46Z,MEMBER,"@darothen you might be interested by the discussion we had [here](https://github.com/pydata/xarray/issues/1077#issuecomment-260162320), although it doesn't solve anything related to selection across similar Dataset objects.

I think that the collection of `Dataset` objects with like-dimensions that you suggest is indeed different than the tree-like structure within a dataset that is proposed here (the latter still using a unique set of dimensions and coordinates).

Both approaches may co-exist, though. I can imagine the case where we have (1) a set of, e.g., grid-search or monte-carlo model runs and (2) for each model run we have diagnostic variables defined in different places on the grid (e.g., nodes, edges...). The tuple-defined groups within a Dataset is useful for 2 and the collection of Dataset objects is useful for 1.

As [pointed out](https://github.com/pydata/xarray/issues/1077#issuecomment-260686932) by @shoyer, such a collection of Dataset objects might be (preferably) implemented outside of xarray.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-290106782,https://api.github.com/repos/pydata/xarray/issues/1092,290106782,MDEyOklzc3VlQ29tbWVudDI5MDEwNjc4Mg==,4992424,2017-03-29T14:26:15Z,2017-03-29T14:26:15Z,NONE,"Would the domain for this just be to simulate the tree-like structure that NetCDF permits, or could it extend to multiple datasets on disk? One of the ideas that we had [during the aospy hackathon](https://aospy.hackpad.com/Data-StorageDiscovery-Design-Document-fM6LgfwrJ2K) involved some sort of idiom based on xarray for packing multiple, similar datasets together. For instance, it's very common in climate science to re-run a model multiple times nearly identically, but changing a parameter or boundary condition. So you end up with large archives of data on disk which are identical in shape and metadata, and you want to be able to quickly analyze across them.

As an example, I built [a helper tool](https://github.com/darothen/experiment/blob/master/experiment/experiment.py) during my dissertation to automate much of this, allowing you to dump your processed output in some sort of directory structure and consistent naming scheme, and then easily ingest what you need for a given analysis. It's actually working great for a much larger, Monte Carlo set of model simulations right now (3 factor levels with 3-5 values at each level, for a total of 1500 years of simulation). My tool works by concatenating each experimental factor as a new dimension, which lets you use xarray's selection tools to perform analyses across the ensemble. You can pre-process things before concatenating too, if the data ends up being too big to fit in memory (e.g. for every simulation in the experiment, compute time-zonal averages before concatenation). 

Going back to @shoyer's [comment](https://github.com/pydata/xarray/issues/1092#issuecomment-259206339), it still seems as though there is room to build some sort of collection of `Dataset`s, in the same way that a `Dataset` is a collection of `DataArray`s. Maybe this is different than @lamorton's grouping example, but it would be really, really cool if you could use the same sort of syntactic sugar to select across multiple `Dataset`s with like-dimensions just as you could slice into groups inside a `Dataset` as proposed here. It would certainly make things much more manageable than concatenating huge combinations of `Dataset`s in memory!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-290065632,https://api.github.com/repos/pydata/xarray/issues/1092,290065632,MDEyOklzc3VlQ29tbWVudDI5MDA2NTYzMg==,4160723,2017-03-29T11:48:12Z,2017-03-29T11:48:12Z,MEMBER,"Just want to say that I'm very enthusiastic about this!

Like @lamorton, I also find myself having a lot of variables with names containing the name(s) of their ""group(s)"".

My initial idea was also to keep flat datasets and add some logic to get/set groups, but it wasn't very clear and well explained.

> One important reason to keep the tree-like structure within a dataset is that it provides some assurance to the recipient of the dataset that all the variables 'belong' in the same coordinate space.

Makes perfect sense!

I also find the idea of using tuples very clever! @shoyer do you have an idea on how it would work with serialization to netCDF? We would also have to decide how to display groups in the repr of the flat dataset...

@lamorton @shoyer unless you want to open a PR, I'd be willing to start working on this.




","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-289923078,https://api.github.com/repos/pydata/xarray/issues/1092,289923078,MDEyOklzc3VlQ29tbWVudDI4OTkyMzA3OA==,1217238,2017-03-28T22:22:31Z,2017-03-28T22:24:31Z,MEMBER,"@lamorton Thanks for explaining the use case here. This makes more sense to me now. I like your idea of groups as syntactic sugar around flat datasets with named keys.

With an appropriate naming convention, we might even be able to put this into `xarray.Dataset` proper. Tuples like `('flux', 'poloidal', 'perturbed')` would be more appropriate than a string based convention, because they are easier to use programmatically.

- In Python syntax, `ds[x, y]` is equivalent to `ds[(x, y)]`. Thus `ds['flux', 'poloidal', 'perturbed']` works to pull out a named variable.
- If no variable with the name `flux` is found, `ds['flux']` or `ds.flux` would return a Dataset with all variables with names given by tuples starting with `'flux'`, removing the prefix `'flux'` from each name (e.g., `('poloidal', 'perturbed')` would be a variable in `ds.flux`). This means that `ds.flux.poloidal.perturbed` and `ds['flux']['poloidal']['perturbed']` should automatically work.
- `ds['flux', 'poloidal', 'perturbed'] = data_array` would assign the variable `('flux', 'poloidal', 'perturbed')`, and implicitly create the `'flux'` group (which in turn contains the `'poloidal'` group). Note that it's *not* possible to make assignments like `ds['flux']['poloidal']['perturbed'] = data_array` work, so we should discourage this syntax.
- `ds['flux'] = poloidal_ds` would become valid, and work by assigning all variables in `poloidal_ds` into `ds` by prefixing their names with `'flux'`.
- Similarly, nested arguments could also be supported in the `Dataset` constructor, e.g., `xarray.Dataset({'flux': {'poloidal': {'perturbed': data_array}}})` becomes syntactic sugar for `xarray.Dataset({('flux', 'poloidal', 'perturbed'): data_array})`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-289916013,https://api.github.com/repos/pydata/xarray/issues/1092,289916013,MDEyOklzc3VlQ29tbWVudDI4OTkxNjAxMw==,23484003,2017-03-28T21:51:30Z,2017-03-28T21:51:30Z,NONE,"One important reason to keep the tree-like structure within a dataset is that it provides some assurance to the recipient of the dataset that all the variables 'belong' in the same coordinate space.  Constructing a tree (from a nested dictionary, say) whose leaves are datasets or dataArrays doesn't guarantee that the coordinates/dimensions in all the leaves are compatible, whereas a tree within the dataset does make a guarantee about the leaves.  

As far as motivation for making trees, I find myself with several dozen variable names such as `ds.fluxPoloidalPerturbation` and `ds.fieldToroidalBackground` and various permutations, so it would be logical to be able to write `ds.flux.poloidal` and get a sub-dataset that contains dataArrays named `perturbation` and `background`.  

As far as implementation, the `DataGroup` could really just be syntactic sugar around a flat dataset that is hidden from the user, and has keys like `'flux.poloidal.perturbed,'` so that `dg.flux.poloidal.perturbed` would be an alias to `dg.__hiddenDataset__['flux.poloidal.perturbed']`, and `dg.flux.poloidal` would be an alias to `dg.__hiddenDataset__[['flux.poloidal.perturbed','flux.poloidal.background']]`.  Seems like it would require mucking with `dg.__getattr__`, `dg.__setattr__`, and `dg.__dir__` at a minimum to get it off the ground, but by making the tree virtual, one avoids the difficulties with slicing, etc.  The return type of `dg.__getattr__` should be another `DataGroup`  as long as there are branches in the output, but it should fall back to a `Dataset` when there are only leaves.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-259390660,https://api.github.com/repos/pydata/xarray/issues/1092,259390660,MDEyOklzc3VlQ29tbWVudDI1OTM5MDY2MA==,4160723,2016-11-09T11:15:01Z,2016-11-09T11:24:51Z,MEMBER,"> For example, how do groups get updated when you slice, aggregate or concatenate datasets?

Yep once again I haven't thought about all the implications this would have! This would indeed add much complexity at the end.

I'll try to follow you suggestion of building another data structure, for example - correct me if it's a wrong approach too - a `DatasetGroup` class which would be very similar to `netCDF4.Group` or `h5py.Group` but which would here contain a single `Dataset`.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-259208431,https://api.github.com/repos/pydata/xarray/issues/1092,259208431,MDEyOklzc3VlQ29tbWVudDI1OTIwODQzMQ==,1197350,2016-11-08T17:51:00Z,2016-11-08T17:51:00Z,MEMBER,"This suggestion has some significant overlap with the data store / data discovery discussion from last weekend:

https://aospy.hackpad.com/Data-StorageDiscovery-Design-Document-fM6LgfwrJ2K
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705
https://github.com/pydata/xarray/issues/1092#issuecomment-259206339,https://api.github.com/repos/pydata/xarray/issues/1092,259206339,MDEyOklzc3VlQ29tbWVudDI1OTIwNjMzOQ==,1217238,2016-11-08T17:43:22Z,2016-11-08T17:43:22Z,MEMBER,"I am reluctant to add the additional complexity of groups directly into the `xarray.Dataset` data model. For example, how do groups get updated when you slice, aggregate or concatenate datasets? The rules for coordinates are already pretty complex.

I would rather see this living in another data structure built on top of `xarray.Dataset`, either in xarray or in a separate library.
","{""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187859705