Example of ideal datastructure
The datastructure that is more useful for this kind of analysis is the one that is an arbitrary graph of n-dimensional arrays; forcing the graph to have a hierarchical access allows optional organization; the graph itself can exist as python objects for nodes and references for edges.
If the tree is not necessary/required everything can be placed on the first level, as it is done on a Dataset.
# Example:
## Notation
- `a:b` value `a` has type `b`
- `t[...,n,...]` : type of data array of values of type `t`, with axis of length `n`
- `D(n(,l))` dimension of size `n` with optional labels `l`
- `A(t,*(dims:tuple[D])}` : type of data array of values of type `t`, with dimension `dims`
- a tree node `T` is either:
- a dict from hashables to tree nodes, `dict[Hashable,T]`
- a dimension `D`
- a data array `A`
- `a[*tags]:=a[tag[0]][tag[1]]...[tag[len(tag)-1]]`
- `map(f,*args:A,dims:tuple[D])` maps `f` over `args` broadcasting over `dims`
Start with a 2d-dimensional DataArray:
```
d0
(
Graph : (
x->D(x_n,float[x_n])
y->D(y_n)
v->A(float,x,y)
)
Tree : (
{
'x':x,
'y':y,
'v':v,
}
)
)
```
Map a function `f` that introduces a new dimension `w` with constant labels `f_w_l:int[f_w_n]` (through map_blocks or apply_ufunc) and add it to d0:
```
f : x:float->(
Graph:
f_w->D(f_w_n,f_w_l)
a->A(float,f_w)
b->A(float)
Tree:
{
'w':f_w,
'a':a,
'b':b,
})
d1=d0.copy()
d1['f']=map(
f,
d0['v'],
(d0['x'],d0['y'])
)
d1
(
Graph :
x->D(x_n,float[x_n])
y->D(y_n)
v->A(float,x,y)
f_w->D(f_w_n,f_w_l)
f_a->A(float,x,y,f_w)
f_b->A(float,x,y)
Tree :
{
'x':x,
'y':y,
'v':v,
'f':{
'w':f_w,
'a':f_a,
'b':f_b,
}
}
)
```
Map a function `g`, that has a dimension of the same name but different meaning and therefore possibly different length `g_w_n` and `g_w_l`:
```
g : x:float->(
Graph:
g_w->D(g_w_n,g_w_l)
a->A(float,g_w)
b->A(float)
Tree:
{
'w':g_w,
'a':a,
'b':b,
})
d2=d1.copy()
d2['g']=map(
g,
d1['v'],
(d1['x'],d1['y'])
)
d2
(
Graph :
x->D(x_n,float[x_n])
y->D(y_n)
v->A(float,x,y)
f_w->D(f_w_n,f_w_l)
f_a->A(float,x,y,f_w)
f_b->A(float,x,y)
g_w->D(g_w_n,g_w_l)
g_a->A(float,x,y,g_w)
g_b->A(float,x,y)
Tree :
{
'x':x,
'y':y,
'v':v,
'f':{
'w':f_w,
'a':f_a,
'b':f_b,
},
'g':{
'w':g_w,
'a':g_a,
'b':g_b,
}
}
)
```
Notice that both `f` and `g` output a dimension named 'w' but that they have different lengths and possibly different meanings.
Suppose I now want to run analysis on f's and g's output, with a function that takes two a's and outputs a float
Then d3 looks like:
```
h : a1:float,a2:float->(
Graph:
r->A(float)
Tree:
r
d3=d2.copy()
d3['f_g_aa']=map(
h,
d2['f','a'],d2['g','a'],
(d2['x'],d2['y'],d2['f','w'],d2['g','w'])
)
d3
{
Graph :
x->D(x_n,float[x_n])
y->D(y_n)
v->A(float,x,y)
f_w->D(f_w_n,f_w_l)
f_a->A(float,x,y,f_w)
f_b->A(float,x,y)
g_w->D(g_w_n,g_w_l)
g_a->A(float,x,y,g_w)
g_b->A(float,x,y)
f_g_aa->A(float,x,y,f_w,g_w)
Tree :
{
'x':x,
'y':y,
'v':v,
'f':{
'w':f_w,
'a':f_a,
'b':f_b,
},
'g':{
'w':g_w,
'a':g_a,
'b':g_b,
}
'f_g_aa': f_g_aa
}
}
```
Compared to what I posted before, I dropped the resolving the dimension for a array by its position in the hierarchy since it would be innaplicable when a variable refers to dimensions in a different branch of the tree.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4118#issuecomment-1047915016,https://api.github.com/repos/pydata/xarray/issues/4118,1047915016,IC_kwDOAMm_X84-deoI,4441338,2022-02-22T15:30:00Z,2022-02-22T15:38:52Z,NONE,"Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst.
For this, it would make more sense to be able to have dimensions ( with optional labels and coordinates ) assigned to nodes (and these would be inherited by any descendants). Leaf nodes would hold data.
On merge, dimensions could be bubbled up as long as length (and labels) matched.
Operations with dimensions would then go down to corresponding dimension level before applying the operator, i.e. `container['A/B'].mean('time')` would be different from `container['A'].mean('time')['B']`.
Datagroup and Datatree are subcases of this general structure, which could be enforced via flags/checks.
Option 1 is where the extremities of the tree are a node with two sets of child nodes, dimension labels and n-dimensional arrays.
Option 2 is where the extremities of the tree are a node with a child node for a n-dimensional array A, and a sibling node for each dimension of A, containing the corresponding labels.
I'm sure I'm missing some big issue with the mental model I have, for instance I haven't thought of transformations at all and about coordinates. But for clarity I tried to write it down below.
The most general structure for a dataset I can think of is a directed graph.
Each node A is a n-dimensional (sparse) array, where each dimension D points optionally to a one-dimensional node B with the same length.
To get a hierarchical structure, we:
- add edges of a different color, each with a label
- restrict their graph to a tree T
- add labels to each dimension D
We can resolve D's target by (A) checking for a sibling in T with the same name, and then going up one level and goto (A).
Multindexes ( multi-dimensional (sparse) labels ) generalize this model, but require tuple labels in T's edges i.e. :
h/j/a[x,y,z] has a sybling h/j/(x,y)[x,y] , with z's labels being one level above, i.e. h/z[z] ( the notation a[b] means map of index b to value a ).
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,628719058
https://github.com/pydata/xarray/issues/4628#issuecomment-979412822,https://api.github.com/repos/pydata/xarray/issues/4628,979412822,IC_kwDOAMm_X846YKdW,4441338,2021-11-25T18:23:28Z,2021-11-25T18:23:28Z,NONE,Any pointers regarding where to start / modules involved to implement this? I would like to have a try.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,753852119
https://github.com/pydata/xarray/pull/6028#issuecomment-979320921,https://api.github.com/repos/pydata/xarray/issues/6028,979320921,IC_kwDOAMm_X846X0BZ,4441338,2021-11-25T15:54:40Z,2021-11-25T15:54:40Z,NONE,"@keewis thanks, it is a duplicate. it must work in my specific case because I was testing with just one file.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1063661422
https://github.com/pydata/xarray/issues/5228#issuecomment-828355557,https://api.github.com/repos/pydata/xarray/issues/5228,828355557,MDEyOklzc3VlQ29tbWVudDgyODM1NTU1Nw==,4441338,2021-04-28T10:46:59Z,2021-04-28T10:46:59Z,NONE,"I think I should have not expected a simple slice to make the choices that are necessary for evenly sampling a possibly irregular index. Assuming most users won't, I'm closing this.
Below is what I'm using now.
```python
def make_intervals(a,b,c):
return a+np.arange(0,1+int(np.floor((b-a)/c)))*c
print(d.sel(dim0=make_intervals(0,1,.1),method='nearest',tolerance=.01))
```
```