home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 1059382908

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4118#issuecomment-1059382908 https://api.github.com/repos/pydata/xarray/issues/4118 1059382908 IC_kwDOAMm_X84_JOZ8 4441338 2022-03-04T17:46:03Z 2022-03-04T18:14:19Z NONE

Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst.

By "variable" length, do you mean that the length of dimensions differs between variables in the same group, or just that you don't know the length of the dimension in advance?

I mean that I might have, for instance, a map from 2 variables to data, ie (x,y)->c, that I can write as a DataArray XY with two dimensions x and y and the values being c. Then I have a function f so that f(c)->d[g(c)], i.e. it yields an array whose length depends on c. I wish I could say : apply f to XY, building a variable length array as you get the output. It could be stored as sparse matrice (X,Y,G). This is a bit out of scope for this discussion; but it is related since creating a differently named group per dimension length is often mentioned as a workaround ( which does not scale when you have a 1000x(variable length dimension) data).

Is there a specific use case which you think would require explicit dimensions to solve?

The use-case is iteratively adding values to a dataset by mapping functions over multiple variables / dimensions in arbitrary compositions. This happens in the context of data analysis, where you start with some source data and then iteratively create analysis functions, and then want to query / display / do statistics/reductions on the set of original data + analysis. Explicit hierarchical dimensions allow for merging and referring to data with no collisions in a single datatree/group.

PS: in netcdf-4 dimensions are seen by children, it matches what I previously posted; in HDF5 nodes are hardlinks to the actual data , this might be exactly the xarray-datagroup posted above.

Example of ideal datastructure

The datastructure that is more useful for this kind of analysis is the one that is an arbitrary graph of n-dimensional arrays; forcing the graph to have a hierarchical access allows optional organization; the graph itself can exist as python objects for nodes and references for edges. If the tree is not necessary/required everything can be placed on the first level, as it is done on a Dataset. # Example: ## Notation - `a:b` value `a` has type `b` - `t[...,n,...]` : type of data array of values of type `t`, with axis of length `n` - `D(n(,l))` dimension of size `n` with optional labels `l` - `A(t,*(dims:tuple[D])}` : type of data array of values of type `t`, with dimension `dims` - a tree node `T` is either: - a dict from hashables to tree nodes, `dict[Hashable,T]` - a dimension `D` - a data array `A` - `a[*tags]:=a[tag[0]][tag[1]]...[tag[len(tag)-1]]` - `map(f,*args:A,dims:tuple[D])` maps `f` over `args` broadcasting over `dims` Start with a 2d-dimensional DataArray: ``` d0 ( Graph : ( x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) ) Tree : ( { 'x':x, 'y':y, 'v':v, } ) ) ``` Map a function `f` that introduces a new dimension `w` with constant labels `f_w_l:int[f_w_n]` (through map_blocks or apply_ufunc) and add it to d0: ``` f : x:float->( Graph: f_w->D(f_w_n,f_w_l) a->A(float,f_w) b->A(float) Tree: { 'w':f_w, 'a':a, 'b':b, }) d1=d0.copy() d1['f']=map( f, d0['v'], (d0['x'],d0['y']) ) d1 ( Graph : x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) f_w->D(f_w_n,f_w_l) f_a->A(float,x,y,f_w) f_b->A(float,x,y) Tree : { 'x':x, 'y':y, 'v':v, 'f':{ 'w':f_w, 'a':f_a, 'b':f_b, } } ) ``` Map a function `g`, that has a dimension of the same name but different meaning and therefore possibly different length `g_w_n` and `g_w_l`: ``` g : x:float->( Graph: g_w->D(g_w_n,g_w_l) a->A(float,g_w) b->A(float) Tree: { 'w':g_w, 'a':a, 'b':b, }) d2=d1.copy() d2['g']=map( g, d1['v'], (d1['x'],d1['y']) ) d2 ( Graph : x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) f_w->D(f_w_n,f_w_l) f_a->A(float,x,y,f_w) f_b->A(float,x,y) g_w->D(g_w_n,g_w_l) g_a->A(float,x,y,g_w) g_b->A(float,x,y) Tree : { 'x':x, 'y':y, 'v':v, 'f':{ 'w':f_w, 'a':f_a, 'b':f_b, }, 'g':{ 'w':g_w, 'a':g_a, 'b':g_b, } } ) ``` Notice that both `f` and `g` output a dimension named 'w' but that they have different lengths and possibly different meanings. Suppose I now want to run analysis on f's and g's output, with a function that takes two a's and outputs a float Then d3 looks like: ``` h : a1:float,a2:float->( Graph: r->A(float) Tree: r d3=d2.copy() d3['f_g_aa']=map( h, d2['f','a'],d2['g','a'], (d2['x'],d2['y'],d2['f','w'],d2['g','w']) ) d3 { Graph : x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) f_w->D(f_w_n,f_w_l) f_a->A(float,x,y,f_w) f_b->A(float,x,y) g_w->D(g_w_n,g_w_l) g_a->A(float,x,y,g_w) g_b->A(float,x,y) f_g_aa->A(float,x,y,f_w,g_w) Tree : { 'x':x, 'y':y, 'v':v, 'f':{ 'w':f_w, 'a':f_a, 'b':f_b, }, 'g':{ 'w':g_w, 'a':g_a, 'b':g_b, } 'f_g_aa': f_g_aa } } ``` Compared to what I posted before, I dropped the resolving the dimension for a array by its position in the hierarchy since it would be innaplicable when a variable refers to dimensions in a different branch of the tree.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  628719058
Powered by Datasette · Queries took 0.849ms · About: xarray-datasette