Example of ideal datastructure
The datastructure that is more useful for this kind of analysis is the one that is an arbitrary graph of n-dimensional arrays; forcing the graph to have a hierarchical access allows optional organization; the graph itself can exist as python objects for nodes and references for edges.
If the tree is not necessary/required everything can be placed on the first level, as it is done on a Dataset.
# Example:
## Notation
- `a:b` value `a` has type `b`
- `t[...,n,...]` : type of data array of values of type `t`, with axis of length `n`
- `D(n(,l))` dimension of size `n` with optional labels `l`
- `A(t,*(dims:tuple[D])}` : type of data array of values of type `t`, with dimension `dims`
- a tree node `T` is either:
- a dict from hashables to tree nodes, `dict[Hashable,T]`
- a dimension `D`
- a data array `A`
- `a[*tags]:=a[tag[0]][tag[1]]...[tag[len(tag)-1]]`
- `map(f,*args:A,dims:tuple[D])` maps `f` over `args` broadcasting over `dims`
Start with a 2d-dimensional DataArray:
```
d0
(
Graph : (
x->D(x_n,float[x_n])
y->D(y_n)
v->A(float,x,y)
)
Tree : (
{
'x':x,
'y':y,
'v':v,
}
)
)
```
Map a function `f` that introduces a new dimension `w` with constant labels `f_w_l:int[f_w_n]` (through map_blocks or apply_ufunc) and add it to d0:
```
f : x:float->(
Graph:
f_w->D(f_w_n,f_w_l)
a->A(float,f_w)
b->A(float)
Tree:
{
'w':f_w,
'a':a,
'b':b,
})
d1=d0.copy()
d1['f']=map(
f,
d0['v'],
(d0['x'],d0['y'])
)
d1
(
Graph :
x->D(x_n,float[x_n])
y->D(y_n)
v->A(float,x,y)
f_w->D(f_w_n,f_w_l)
f_a->A(float,x,y,f_w)
f_b->A(float,x,y)
Tree :
{
'x':x,
'y':y,
'v':v,
'f':{
'w':f_w,
'a':f_a,
'b':f_b,
}
}
)
```
Map a function `g`, that has a dimension of the same name but different meaning and therefore possibly different length `g_w_n` and `g_w_l`:
```
g : x:float->(
Graph:
g_w->D(g_w_n,g_w_l)
a->A(float,g_w)
b->A(float)
Tree:
{
'w':g_w,
'a':a,
'b':b,
})
d2=d1.copy()
d2['g']=map(
g,
d1['v'],
(d1['x'],d1['y'])
)
d2
(
Graph :
x->D(x_n,float[x_n])
y->D(y_n)
v->A(float,x,y)
f_w->D(f_w_n,f_w_l)
f_a->A(float,x,y,f_w)
f_b->A(float,x,y)
g_w->D(g_w_n,g_w_l)
g_a->A(float,x,y,g_w)
g_b->A(float,x,y)
Tree :
{
'x':x,
'y':y,
'v':v,
'f':{
'w':f_w,
'a':f_a,
'b':f_b,
},
'g':{
'w':g_w,
'a':g_a,
'b':g_b,
}
}
)
```
Notice that both `f` and `g` output a dimension named 'w' but that they have different lengths and possibly different meanings.
Suppose I now want to run analysis on f's and g's output, with a function that takes two a's and outputs a float
Then d3 looks like:
```
h : a1:float,a2:float->(
Graph:
r->A(float)
Tree:
r
d3=d2.copy()
d3['f_g_aa']=map(
h,
d2['f','a'],d2['g','a'],
(d2['x'],d2['y'],d2['f','w'],d2['g','w'])
)
d3
{
Graph :
x->D(x_n,float[x_n])
y->D(y_n)
v->A(float,x,y)
f_w->D(f_w_n,f_w_l)
f_a->A(float,x,y,f_w)
f_b->A(float,x,y)
g_w->D(g_w_n,g_w_l)
g_a->A(float,x,y,g_w)
g_b->A(float,x,y)
f_g_aa->A(float,x,y,f_w,g_w)
Tree :
{
'x':x,
'y':y,
'v':v,
'f':{
'w':f_w,
'a':f_a,
'b':f_b,
},
'g':{
'w':g_w,
'a':g_a,
'b':g_b,
}
'f_g_aa': f_g_aa
}
}
```
Compared to what I posted before, I dropped the resolving the dimension for a array by its position in the hierarchy since it would be innaplicable when a variable refers to dimensions in a different branch of the tree.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1047944213,https://api.github.com/repos/pydata/xarray/issues/4118,1047944213,IC_kwDOAMm_X84-dlwV,35968931,2022-02-22T15:58:48Z,2022-02-22T15:58:48Z,MEMBER,"Also thanks @OriolAbril , it's useful to have an ArViz perspective.
> I was also wondering what changes (if any) would each option imply when using `apply_ufunc`
I see `apply_ufunc` as a `Variable`-level operation - i.e. it doesn't know about the relationship between different Variables unless you explicit feed it multiple variables. So therefore whether we choose model 1 or 2 probably doesn't affect `apply_ufunc` much.
In either case I imagine all we might need to do is slightly extend `apply_ufunc` to also map over variables in a group of a tree if given one, and provide examples of using [`map_over_subtree`](https://github.com/TomNicholas/datatree/blob/3beff56f653ba430b41b0aab971571072ff30334/datatree/mapping.py#L106) or similar to map your `apply_ufunc` operation over multiple groups in a tree. If the user is trying to do something more complicated (like getting one variable from one level of a tree and another variable from another level, then feeding both into `apply_ufunc`) then I would just make the user responsible for fetching the variables in that case, and also for putting the results back into the intended place in the tree.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1047932340,https://api.github.com/repos/pydata/xarray/issues/4118,1047932340,IC_kwDOAMm_X84-di20,35968931,2022-02-22T15:47:15Z,2022-02-22T15:50:41Z,MEMBER,"Hi @LunarLanding , thanks for your ideas!
> For this, it would make more sense to be able to have dimensions ( with optional labels and coordinates ) assigned to nodes (and these would be inherited by any descendants).
It sounds a bit like what you are suggesting is essentially a model in which dimensions are explicit objects, which can be referred to from other groups, like in netCDF. (NetCDF has ""dimension IDs"".)
This would be a bit of a departure from the model that `xarray.Dataset` currently uses, because right now dimensions aren't really unique entities, they are just a collective label for a shared dimension of a set of `Variable` objects.
> Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst.
By ""variable"" length, do you mean that the length of dimensions differs between variables in the same group, or just that you don't know the length of the dimension in advance?
Is there a specific use case which you think would require explicit dimensions to solve?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1047915016,https://api.github.com/repos/pydata/xarray/issues/4118,1047915016,IC_kwDOAMm_X84-deoI,4441338,2022-02-22T15:30:00Z,2022-02-22T15:38:52Z,NONE,"Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst.
For this, it would make more sense to be able to have dimensions ( with optional labels and coordinates ) assigned to nodes (and these would be inherited by any descendants). Leaf nodes would hold data.
On merge, dimensions could be bubbled up as long as length (and labels) matched.
Operations with dimensions would then go down to corresponding dimension level before applying the operator, i.e. `container['A/B'].mean('time')` would be different from `container['A'].mean('time')['B']`.
Datagroup and Datatree are subcases of this general structure, which could be enforced via flags/checks.
Option 1 is where the extremities of the tree are a node with two sets of child nodes, dimension labels and n-dimensional arrays.
Option 2 is where the extremities of the tree are a node with a child node for a n-dimensional array A, and a sibling node for each dimension of A, containing the corresponding labels.
I'm sure I'm missing some big issue with the mental model I have, for instance I haven't thought of transformations at all and about coordinates. But for clarity I tried to write it down below.
The most general structure for a dataset I can think of is a directed graph.
Each node A is a n-dimensional (sparse) array, where each dimension D points optionally to a one-dimensional node B with the same length.
To get a hierarchical structure, we:
- add edges of a different color, each with a label
- restrict their graph to a tree T
- add labels to each dimension D
We can resolve D's target by (A) checking for a sibling in T with the same name, and then going up one level and goto (A).
Multindexes ( multi-dimensional (sparse) labels ) generalize this model, but require tuple labels in T's edges i.e. :
h/j/a[x,y,z] has a sybling h/j/(x,y)[x,y] , with z's labels being one level above, i.e. h/z[z] ( the notation a[b] means map of index b to value a ).
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1044853795,https://api.github.com/repos/pydata/xarray/issues/4118,1044853795,IC_kwDOAMm_X84-RzQj,23738400,2022-02-18T17:06:57Z,2022-02-18T17:06:57Z,CONTRIBUTOR,"I am not sure I completely understand option 2, but option 1 seems a better fit to what we are doing at ArviZ (so far we are managing quite well with the InferenceData mentioned above which is a collection of independent xarray datasets). In our case, well defined selection for multiple variables at the same time (i.e. at the dataset level) is very useful.
I was also wondering what changes (if any) would each option imply when using `apply_ufunc`","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1043638105,https://api.github.com/repos/pydata/xarray/issues/4118,1043638105,IC_kwDOAMm_X84-NKdZ,35968931,2022-02-17T23:47:44Z,2022-02-17T23:47:44Z,MEMBER,"> This is only true for flat netCDF files, once you introduce groups in a netCDF AND accept CF conventions the DataGroup approach can map 100% of the files, while the DataTree approach fails on a (admittedly small) class of them.
@alexamici can you expand on the role of the CF conventions in this statement? Are you talking about CF conventions allowing one variable in one group to refer to dimension present in another group, or something else?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1042769595,https://api.github.com/repos/pydata/xarray/issues/4118,1042769595,IC_kwDOAMm_X84-J2a7,5821660,2022-02-17T09:58:18Z,2022-02-17T09:58:18Z,MEMBER,"> in the representation I use the fully qualified name for the dimension / coordinate, but the corresponding `DataArray` will use the basename, e.g. both array will have `lat` as a coordinate. Sorry for te confusion, I need to add more context to the README.
Thanks for clarifying. I'm wondering if that can be a source of misunderstanding. How should the user differentiate that? I mean finally those dimensions which have the same name `lat` are different entities and it should be possible to tell them apart somehow. I think I'm slowly getting to the bottom of this (representation in dictionaries, duplicate keys) and I really need to look into the implementation. I'll open an issue over at xarray-datagroup if I have more questions to not clutter the discussion here.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1042753800,https://api.github.com/repos/pydata/xarray/issues/4118,1042753800,IC_kwDOAMm_X84-JykI,226037,2022-02-17T09:41:29Z,2022-02-17T09:53:55Z,MEMBER,"@kmuehlbauer in the representation I use the fully qualified name for the dimension / coordinate, but the corresponding `DataArray` will use the basename, e.g. both array will have `lat` as a coordinate. Sorry for the confusion, I need to add more context to the README.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1042731962,https://api.github.com/repos/pydata/xarray/issues/4118,1042731962,IC_kwDOAMm_X84-JtO6,5821660,2022-02-17T09:17:55Z,2022-02-17T09:17:55Z,MEMBER,"@alexamici
> * in [netCDF following the CF conventions for groups](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#groups), it is legal for an array to refer to a dimension or a coordinate in a different group and so arrays in the same group may have dimensions with the same name, but different size / coordinate values, (this was the orginal motivation to explore the DataGroup approach)
I'm having difficulties to understand your above point wrt to the scoping rules from the above CF document. Shouldn't it be impossible to create two arrays (in the same group) having dimensions with exactly the same name from different groups? Looking at the example here https://github.com/alexamici/xarray-datagroup there are coordinates with name ""/lat"" vs ""lat"". Aren't that two different names? Maybe I'm missing something essential here.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1042656377,https://api.github.com/repos/pydata/xarray/issues/4118,1042656377,IC_kwDOAMm_X84-Jax5,226037,2022-02-17T07:39:15Z,2022-02-17T08:17:51Z,MEMBER,"@TomNicholas (cc @mraspaud)
> Do you have use cases which one of these designs could handle but the other couldn't?
The two main classes of on-disk formats that, I know of, which cannot be always represented in the ""group is a Dataset"" approach are:
- in [netCDF following the CF conventions for groups](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#groups), it is legal for an array to refer to a dimension or a coordinate in a different group and so arrays in the same group may have dimensions with the same name, but different size / coordinate values, (this was the orginal motivation to explore the DataGroup approach)
- the current spec for the [Next-generation file formats (NGFF)](https://ngff.openmicroscopy.org) for bio-imaging has all scales of the same 5D data in the same group. (cc @joshmoore)
I don't have an example at hand, but my impression is that satellite products that use HDF5 file format also place arrays with inconsistent dimensions / coordinates in the same group.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1042664227,https://api.github.com/repos/pydata/xarray/issues/4118,1042664227,IC_kwDOAMm_X84-Jcsj,226037,2022-02-17T07:52:17Z,2022-02-17T07:53:13Z,MEMBER,"@TomNicholas I also have a few comments on the comparison:
> * **Option (1) - Each group is a Dataset**
>
> * Model maps more directly onto netCDF (though still not exactly, because netCDF has dimensions as separate objects)
This is only true for flat netCDF files, once you introduce groups in a netCDF AND accept CF conventions the DataGroup approach can map 100% of the files, while the DataTree approach fails on a (admittedly small) class of them.
> * Enforcing consistency between variables guarantees certain operations are always well-defined (in particular selection via an integer index like in `.isel`).
> * Guarantees that all valid operations on a Dataset are also valid operations on a single group of a DataTree - so API can be essentially identical to Dataset.
Both points are only true for the DataArray in a single group, once you broadcast any operation to subgroups the two implementations would share the same limitations (dimensions in subgroups can be inconsistent in both cases).
In my opinion the advantage for the DataTree is minimal.
> * Metadata (i.e. `.attrs`) are arguably most useful when set at this level
The two approach are identical in this respect, group attributes are mapped in the same way to DataTree and DataGroup
I share your views on all other points.","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 1, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1042660100,https://api.github.com/repos/pydata/xarray/issues/4118,1042660100,IC_kwDOAMm_X84-JbsE,1217238,2022-02-17T07:45:24Z,2022-02-17T07:45:24Z,MEMBER,"One thing that came up in our discussion about this in the developer
meeting today is that we could also pretty easily expose a ""low level"" API
for IO using dictionaries of xarray.Variable objects. This intermediate
representation could be useful for cleaning up data into a form suitable
for conversion into Dataset objects.
On Wed, Feb 16, 2022 at 11:39 PM Alessandro Amici ***@***.***>
wrote:
> @TomNicholas