home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

12 rows where author_association = "NONE" and issue = 628719058 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 7

  • joshmoore 3
  • tacaswell 3
  • LunarLanding 2
  • jakirkham 1
  • nbercher 1
  • martinitus 1
  • emilbiju 1

issue 1

  • Feature Request: Hierarchical storage and processing in xarray · 12 ✖

author_association 1

  • NONE · 12 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1198743015 https://github.com/pydata/xarray/issues/4118#issuecomment-1198743015 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X85Hc13n jakirkham 3019665 2022-07-29T00:14:46Z 2022-07-29T00:14:46Z NONE

Wanted to note issue ( https://github.com/carbonplan/ndpyramid/issues/10 ) here, which may be of interest to people here.

Also we are thinking about a Dask blogpost in this space if people have thoughts on what we should include and/or are interested in being involved. Details in issue ( https://github.com/dask/dask-blog/issues/141 ).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
1059499222 https://github.com/pydata/xarray/issues/4118#issuecomment-1059499222 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X84_JqzW tacaswell 199813 2022-03-04T20:25:47Z 2022-03-04T20:25:47Z NONE

@LunarLanding You may also be interested in awkward array.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
1059382908 https://github.com/pydata/xarray/issues/4118#issuecomment-1059382908 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X84_JOZ8 LunarLanding 4441338 2022-03-04T17:46:03Z 2022-03-04T18:14:19Z NONE

Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst.

By "variable" length, do you mean that the length of dimensions differs between variables in the same group, or just that you don't know the length of the dimension in advance?

I mean that I might have, for instance, a map from 2 variables to data, ie (x,y)->c, that I can write as a DataArray XY with two dimensions x and y and the values being c. Then I have a function f so that f(c)->d[g(c)], i.e. it yields an array whose length depends on c. I wish I could say : apply f to XY, building a variable length array as you get the output. It could be stored as sparse matrice (X,Y,G). This is a bit out of scope for this discussion; but it is related since creating a differently named group per dimension length is often mentioned as a workaround ( which does not scale when you have a 1000x(variable length dimension) data).

Is there a specific use case which you think would require explicit dimensions to solve?

The use-case is iteratively adding values to a dataset by mapping functions over multiple variables / dimensions in arbitrary compositions. This happens in the context of data analysis, where you start with some source data and then iteratively create analysis functions, and then want to query / display / do statistics/reductions on the set of original data + analysis. Explicit hierarchical dimensions allow for merging and referring to data with no collisions in a single datatree/group.

PS: in netcdf-4 dimensions are seen by children, it matches what I previously posted; in HDF5 nodes are hardlinks to the actual data , this might be exactly the xarray-datagroup posted above.

Example of ideal datastructure

The datastructure that is more useful for this kind of analysis is the one that is an arbitrary graph of n-dimensional arrays; forcing the graph to have a hierarchical access allows optional organization; the graph itself can exist as python objects for nodes and references for edges. If the tree is not necessary/required everything can be placed on the first level, as it is done on a Dataset. # Example: ## Notation - `a:b` value `a` has type `b` - `t[...,n,...]` : type of data array of values of type `t`, with axis of length `n` - `D(n(,l))` dimension of size `n` with optional labels `l` - `A(t,*(dims:tuple[D])}` : type of data array of values of type `t`, with dimension `dims` - a tree node `T` is either: - a dict from hashables to tree nodes, `dict[Hashable,T]` - a dimension `D` - a data array `A` - `a[*tags]:=a[tag[0]][tag[1]]...[tag[len(tag)-1]]` - `map(f,*args:A,dims:tuple[D])` maps `f` over `args` broadcasting over `dims` Start with a 2d-dimensional DataArray: ``` d0 ( Graph : ( x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) ) Tree : ( { 'x':x, 'y':y, 'v':v, } ) ) ``` Map a function `f` that introduces a new dimension `w` with constant labels `f_w_l:int[f_w_n]` (through map_blocks or apply_ufunc) and add it to d0: ``` f : x:float->( Graph: f_w->D(f_w_n,f_w_l) a->A(float,f_w) b->A(float) Tree: { 'w':f_w, 'a':a, 'b':b, }) d1=d0.copy() d1['f']=map( f, d0['v'], (d0['x'],d0['y']) ) d1 ( Graph : x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) f_w->D(f_w_n,f_w_l) f_a->A(float,x,y,f_w) f_b->A(float,x,y) Tree : { 'x':x, 'y':y, 'v':v, 'f':{ 'w':f_w, 'a':f_a, 'b':f_b, } } ) ``` Map a function `g`, that has a dimension of the same name but different meaning and therefore possibly different length `g_w_n` and `g_w_l`: ``` g : x:float->( Graph: g_w->D(g_w_n,g_w_l) a->A(float,g_w) b->A(float) Tree: { 'w':g_w, 'a':a, 'b':b, }) d2=d1.copy() d2['g']=map( g, d1['v'], (d1['x'],d1['y']) ) d2 ( Graph : x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) f_w->D(f_w_n,f_w_l) f_a->A(float,x,y,f_w) f_b->A(float,x,y) g_w->D(g_w_n,g_w_l) g_a->A(float,x,y,g_w) g_b->A(float,x,y) Tree : { 'x':x, 'y':y, 'v':v, 'f':{ 'w':f_w, 'a':f_a, 'b':f_b, }, 'g':{ 'w':g_w, 'a':g_a, 'b':g_b, } } ) ``` Notice that both `f` and `g` output a dimension named 'w' but that they have different lengths and possibly different meanings. Suppose I now want to run analysis on f's and g's output, with a function that takes two a's and outputs a float Then d3 looks like: ``` h : a1:float,a2:float->( Graph: r->A(float) Tree: r d3=d2.copy() d3['f_g_aa']=map( h, d2['f','a'],d2['g','a'], (d2['x'],d2['y'],d2['f','w'],d2['g','w']) ) d3 { Graph : x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) f_w->D(f_w_n,f_w_l) f_a->A(float,x,y,f_w) f_b->A(float,x,y) g_w->D(g_w_n,g_w_l) g_a->A(float,x,y,g_w) g_b->A(float,x,y) f_g_aa->A(float,x,y,f_w,g_w) Tree : { 'x':x, 'y':y, 'v':v, 'f':{ 'w':f_w, 'a':f_a, 'b':f_b, }, 'g':{ 'w':g_w, 'a':g_a, 'b':g_b, } 'f_g_aa': f_g_aa } } ``` Compared to what I posted before, I dropped the resolving the dimension for a array by its position in the hierarchy since it would be innaplicable when a variable refers to dimensions in a different branch of the tree.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
1047915016 https://github.com/pydata/xarray/issues/4118#issuecomment-1047915016 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X84-deoI LunarLanding 4441338 2022-02-22T15:30:00Z 2022-02-22T15:38:52Z NONE

Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst. For this, it would make more sense to be able to have dimensions ( with optional labels and coordinates ) assigned to nodes (and these would be inherited by any descendants). Leaf nodes would hold data. On merge, dimensions could be bubbled up as long as length (and labels) matched. Operations with dimensions would then go down to corresponding dimension level before applying the operator, i.e. container['A/B'].mean('time') would be different from container['A'].mean('time')['B'].

Datagroup and Datatree are subcases of this general structure, which could be enforced via flags/checks. Option 1 is where the extremities of the tree are a node with two sets of child nodes, dimension labels and n-dimensional arrays. Option 2 is where the extremities of the tree are a node with a child node for a n-dimensional array A, and a sibling node for each dimension of A, containing the corresponding labels.

I'm sure I'm missing some big issue with the mental model I have, for instance I haven't thought of transformations at all and about coordinates. But for clarity I tried to write it down below.

The most general structure for a dataset I can think of is a directed graph. Each node A is a n-dimensional (sparse) array, where each dimension D points optionally to a one-dimensional node B with the same length.

To get a hierarchical structure, we:

  • add edges of a different color, each with a label
  • restrict their graph to a tree T
  • add labels to each dimension D

We can resolve D's target by (A) checking for a sibling in T with the same name, and then going up one level and goto (A).

Multindexes ( multi-dimensional (sparse) labels ) generalize this model, but require tuple labels in T's edges i.e. : h/j/a[x,y,z] has a sybling h/j/(x,y)[x,y] , with z's labels being one level above, i.e. h/z[z] ( the notation a[b] means map of index b to value a ).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
876397215 https://github.com/pydata/xarray/issues/4118#issuecomment-876397215 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDg3NjM5NzIxNQ== martinitus 7611856 2021-07-08T12:27:58Z 2021-07-08T12:27:58Z NONE

As a user who (so far) does not use any netCDF or HDF5 features of xarray I obviously would not like to have a otherwise potentially useful feature blocked by restrictions imposed by netCDF or HDF5 ;-).

That said - I think @tacaswell comment about round trips is very reasonable and such invariants should be maintained! It would be extremely confusing for users if netcdf -> xarray-> netcdf is not a "no-op". The same obviously holds true for any other storage format. As a user I would generally expect something like the following: python a1= xarray.load("foo.myformat") xarray.save( a1, "bar.myformat") a2= xarray.load("bar.myformat") assert a1 == a2, "Why should they not be exactly equal?!?"

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
875121115 https://github.com/pydata/xarray/issues/4118#issuecomment-875121115 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDg3NTEyMTExNQ== tacaswell 199813 2021-07-06T22:21:02Z 2021-07-06T22:21:02Z NONE

That sounds right to me -- a downside of tags is that they can't be (uniquely) expressed in a hierarchical arrangement like those found in HDF5/netCDF4 files.

hdf5 allows for internal links so a datasets and groups can appear in multiple places in the tree. You can even make cycles where groups are in them selves (or their children). The NeXuS format (the xray/neutron one) makes heavy use of this to let data appear both where it "makes sense" from a science point of view from an instrumentation point of view.

I think it is reasonable to expect that netcdf -> xarray -> netcdf always , however I think it is unreasonable to ask that xarray -> netcdf -> xarray will always work. I think it is OK if xarray can express more complex relationship and structures that you can in netcdf (or hdf5 or any existing at-rest format). In an extreme case, consider an interface to a database that returns xarrays 😈 .

{
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
845845472 https://github.com/pydata/xarray/issues/4118#issuecomment-845845472 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDg0NTg0NTQ3Mg== nbercher 6772352 2021-05-21T10:15:13Z 2021-05-21T10:17:00Z NONE

A simple comment/question:

In xarray.Dataset, why not just use the Unix-path notation into a "flat" dict model?

Actually, netCDF4 implements this Unix-like path access to groups and variables: /path/to/group/variable.

All of the hierarchical stuff (e.g., getting a sub-Dataset from a random group) and conventions (e.g., dimensions scoping rule) would then be driven by the parsing of strings only. It's all about symbolic names (like in a file system right?) and there would be not any hierarchical data in memory anymore.

My question is then: Are there some tricky points for xarray.Dataset not to go this simple way?

Some related remarks: - About the attribute access to variables: I don't really know why this exist at all since it is all about mixing unrelated namespaces: (1) the class internals and (2) the user's variables one. Mixing namespaces seems very bad to me: it makes some variable names forbidden in order to avoid any collision between the two namespaces, it usually imply unnecessarily complex code with corner cases to deal with. - About netCDF4 being a self-described format: xarray API has open_dataset(filepath), but this function is unable to read the whole file in memory without getting help from a priori file content description, i.e., the names of the groups if you follow me. Considering xarray for simple tasks like geographical-selection-cropping, it seems to ignore the self-describing nature of netCDF4 format. As far as I can understand the situation, a "flat" model could be a good way to go.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
833574864 https://github.com/pydata/xarray/issues/4118#issuecomment-833574864 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDgzMzU3NDg2NA== joshmoore 88113 2021-05-06T14:36:37Z 2021-05-06T14:36:37Z NONE

Picking up on @dcherian's https://github.com/pydata/xarray/issues/4118#issuecomment-806954634 and @rabernat's https://github.com/ome/ngff/issues/48#issuecomment-833456889, Zarr was also accepted to the second round and certainly references this issue in case we want to sync up. (Apologies if I missed where that discussion moved.)

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 2,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
806494079 https://github.com/pydata/xarray/issues/4118#issuecomment-806494079 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDgwNjQ5NDA3OQ== joshmoore 88113 2021-03-25T09:21:47Z 2021-03-25T09:21:47Z NONE

Happy to provide assistance on the image pyramid (i.e. "multiscale") use case.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
802863863 https://github.com/pydata/xarray/issues/4118#issuecomment-802863863 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDgwMjg2Mzg2Mw== tacaswell 199813 2021-03-19T14:14:13Z 2021-03-19T14:14:13Z NONE

This is related to some very recent work we have been doing at NSLS-II, primarily lead by @danielballan .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
756204582 https://github.com/pydata/xarray/issues/4118#issuecomment-756204582 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDc1NjIwNDU4Mg== joshmoore 88113 2021-01-07T15:57:03Z 2021-01-07T15:57:03Z NONE

Thanks for the link, @jhamman. The most immediate issue I ran into when trying to use xarray with OME-Zarr data does seem similar. A rough representation of one multiscale image is:

image_pyramid: |_ zyx_array_high_res |_ zyx_array_mid_res |_ zyx_array_low_res

but of course the x, y and z dimensions are of different sizes in each volume.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
637382925 https://github.com/pydata/xarray/issues/4118#issuecomment-637382925 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDYzNzM4MjkyNQ== emilbiju 39640592 2020-06-02T08:33:42Z 2020-06-02T08:33:42Z NONE

Thanks @jhamman for sharing the link. Here are my thoughts on the same:

For use-cases similar to the one I have mentioned, I think it would be more meaningful to allow the tree structure (calling it Datatree further) to exist as a separate data structure instead of residing within the Dataset. From what I understand, the xarray Dataset would enforce all its component variables to share the same coordinate set for a given dimension name. This would again result in memory wastage with nan values when the value corresponding to a coordinate is unknown.

Besides, xarray only allows attribute access for getting (and not setting) values, but a separate data structure can allow attribute access for setting values as well. For example, the data structure that I have implemented would allow something like dt.weather = dt.weather.mean('time') to alter all the data arrays under the weather node.

I am currently using attribute-based access for accessing child nodes/data arrays in the Datatree as it appears to reflect the tree structure better, but as @shoyer has pointed out, tuple-based access might be easier to use programmatically.

Instead of using netCDF4 groups for encoding the Datatree, I am currently following a simple 3-step process: - Combine all the data arrays at the leaves of a Datatree object into a dataset. - Add an additional data array to the dataset that would contain an ancestor matrix (or any other array-like representation) that can encode the hierarchical structure with a coordinate set containing names of the tree nodes. - Use the xarray.Dataset.to_netcdf method to store it in a netCDF file.

Therefore, within the netCDF file, it would exist just as a Dataset. A specially implemented Datatree.open_datatree method can open the dataset, detect this additional array and recreate the tree structure to instantiate the object. I would like to know if using netCDF4 groups instead provide any advantages over this approach?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 22.564ms · About: xarray-datasette