home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

14 rows where author_association = "MEMBER", issue = 628719058 and user = 35968931 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • TomNicholas · 14 ✖

issue 1

  • Feature Request: Hierarchical storage and processing in xarray · 14 ✖

author_association 1

  • MEMBER · 14 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
904817641 https://github.com/pydata/xarray/issues/4118#issuecomment-904817641 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X8417mvp TomNicholas 35968931 2021-08-24T17:00:24Z 2022-05-19T16:33:26Z MEMBER

So I had a crack at making a full DataTree class - you can find it in this repo.

It's based on @benbovy's DatasetNode example - the basic idea is that each tree node wraps a single Dataset. The differences are that this effort: - Uses a NodeMixin from anytree for the tree structure, - Implements path-like and tag-like getting and setting, - Has functions for mapping user-supplied functions over every node in the tree, - Automatically dispatches xarray.Dataset's API over every node in the tree (such as .isel or __add__), - Has a bunch of tests, - Has a printable representation that currently looks like this:

Some limitations of the approach I used are: - Each dataset in the tree is entirely separate, so doing something like dt.sel(time=50) would require each Dataset in that subtree to have it's own coordinate called 'time'. (That's normally useful though because then 'time' can be a different resolution on each ds), - While you can access nodes via tags, the underlying implementation is in terms of paths, so ('folder1', 'folder2') points to a different node than ('folder2', 'folder1'), - There's no support for symbolic nodes yet, and I'm unsure if this design can allow for loops or not.

You can create a DataTree object in 3 ways: 1) Load from a netCDF file that has groups via open_datatree(), 2) Using the init method of DataTree, which accepts a nested dictionary of Datasets, 3) Manually create individual nodes with DataNode() and specify their relationships to each other, either by setting .parent and .children attributes, or through __get/setitem__ access, e.g. dt['path/to/node'] = DataNode('node_name', data=xr.Dataset()).

It's about 70% working, but some things I could do with some help with are: 1) ~Fundamental design questions about the class structure, such as whether DataTree should be a subclass of Dataset?~ 2) ~Getting arithmetic and ufuncs to act properly on the whole tree~, 3) ~Saving a tree to a single netCDF file~, (thanks Joe!) 4) ~Setting up CI and all that jazz~, (thanks Joe again!) 5) ~Setting up basic docs.~

There will definitely be many bugs, but any thoughts or input appreciated!

{
    "total_count": 8,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 8,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
1047944213 https://github.com/pydata/xarray/issues/4118#issuecomment-1047944213 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X84-dlwV TomNicholas 35968931 2022-02-22T15:58:48Z 2022-02-22T15:58:48Z MEMBER

Also thanks @OriolAbril , it's useful to have an ArViz perspective.

I was also wondering what changes (if any) would each option imply when using apply_ufunc

I see apply_ufunc as a Variable-level operation - i.e. it doesn't know about the relationship between different Variables unless you explicit feed it multiple variables. So therefore whether we choose model 1 or 2 probably doesn't affect apply_ufunc much.

In either case I imagine all we might need to do is slightly extend apply_ufunc to also map over variables in a group of a tree if given one, and provide examples of using map_over_subtree or similar to map your apply_ufunc operation over multiple groups in a tree. If the user is trying to do something more complicated (like getting one variable from one level of a tree and another variable from another level, then feeding both into apply_ufunc) then I would just make the user responsible for fetching the variables in that case, and also for putting the results back into the intended place in the tree.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
1047932340 https://github.com/pydata/xarray/issues/4118#issuecomment-1047932340 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X84-di20 TomNicholas 35968931 2022-02-22T15:47:15Z 2022-02-22T15:50:41Z MEMBER

Hi @LunarLanding , thanks for your ideas!

For this, it would make more sense to be able to have dimensions ( with optional labels and coordinates ) assigned to nodes (and these would be inherited by any descendants).

It sounds a bit like what you are suggesting is essentially a model in which dimensions are explicit objects, which can be referred to from other groups, like in netCDF. (NetCDF has "dimension IDs".)

This would be a bit of a departure from the model that xarray.Dataset currently uses, because right now dimensions aren't really unique entities, they are just a collective label for a shared dimension of a set of Variable objects.

Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst.

By "variable" length, do you mean that the length of dimensions differs between variables in the same group, or just that you don't know the length of the dimension in advance?

Is there a specific use case which you think would require explicit dimensions to solve?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
1043638105 https://github.com/pydata/xarray/issues/4118#issuecomment-1043638105 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X84-NKdZ TomNicholas 35968931 2022-02-17T23:47:44Z 2022-02-17T23:47:44Z MEMBER

This is only true for flat netCDF files, once you introduce groups in a netCDF AND accept CF conventions the DataGroup approach can map 100% of the files, while the DataTree approach fails on a (admittedly small) class of them.

@alexamici can you expand on the role of the CF conventions in this statement? Are you talking about CF conventions allowing one variable in one group to refer to dimension present in another group, or something else?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
1039572760 https://github.com/pydata/xarray/issues/4118#issuecomment-1039572760 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X8499p8Y TomNicholas 35968931 2022-02-14T21:19:56Z 2022-02-14T21:40:21Z MEMBER

We would like some opinions from the community on two different possible models for a tree-like structure in xarray.

A tree contains many groups, but the question is what constraints should be imposed on the contents of those groups.

  • Option (1) - Each group is a Dataset

    • Means that within each group the same restrictions apply as currently do within a single dataset, i.e. each dimension name is only associated with a single length, so there is effectively a common set of dimensions which variables can depend on.
    • Can't represent all files, in particular can't represent a filetype where groups are allowed to have variables with inconsistent length dimensions (e.g. Zarr stores allow this as all arrays are independent.)
    • Model maps more directly onto netCDF (though still not exactly, because netCDF has dimensions as separate objects)
    • This means that sometimes you might need to put variables in ajdacent groups in the same level of the tree, when you might rather want them together in the same group.
    • Enforcing consistency between variables guarantees certain operations are always well-defined (in particular selection via an integer index like in .isel).
    • Guarantees that all valid operations on a Dataset are also valid operations on a single group of a DataTree - so API can be essentially identical to Dataset.
    • Metadata (i.e. .attrs) are arguably most useful when set at this level
    • Mental model is a (nested) dict of Datasets
    • Prototype is DataTree
  • Option (2) - Variables within groups are unconstrained

    • Means that within a single group each Variable can have any dimensions, of any length. There is no requirement that two variables which both depend on a dimension called "x" have to have the same length, one variable can have .sizes['x']=10 and the other have .sizes['x']=20.
    • The main advantage of this is that it can represent a wider set of files (including all Zarr stores and a wider set of GRIB files)
    • Model maps more directly onto HDF5
    • Doesn't enforce the (arguably fairly arbitrary) constraint that if variables have a dimension of the same name, that dimension must also be the same length
    • Without consistency selection becomes ill-defined, but many other operations are fine (e.g. taking .mean())
    • Mental model is a (nested) dict of dicts of DataArrays
    • Prototype is xarray-DataGroups

This is by no means the only question, and we have various choices to make within these options.

The questions for the potential users here are: - Do you have use cases which one of these designs could handle but the other couldn't? - How important to you is being able to support all valid files of these certain formats? - Which of these designs is clearer/more intuitive/more appealing to you?

(@alexamici , @shoyer, @jhamman, @aurghs please edit this comment to add anything I've missed)

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 2,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
905472692 https://github.com/pydata/xarray/issues/4118#issuecomment-905472692 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X841-Gq0 TomNicholas 35968931 2021-08-25T12:50:04Z 2021-08-25T13:02:10Z MEMBER

Thanks @benbovy !

For rich/html reprs, I think that we could take much inspiration from some of the dask reprs shown in this blog post.

I don't know much about HTML, but graphs where you can mouseover nodes to see node information sound awesome!

what is the rationale of having two separate classes DataTree and DataNode?

They aren't separate: DataNode is merely a (perhaps badly-named) pointer to a second init method for the same DataTree class.

The idea was that creating a single node of a tree by specifying only its (name, dataset, parent, children) attributes was conceptually different to "I have loads of datasets, and I want to arrange them all into one big tree using path-like addresses", so I made two different init methods on DataTree to cover that. The idea was from the xarray.Dataset._construct_direct() classmethod, which creates a new instance of a Dataset by directly setting attributes like (variables, coord_names, dims, attrs). That is an internal classmethod though, and isn't externally exposed like DataNode().

We could just merge the two signatures into one __init__ method though, or use a less confusing name (I just didn't want DataTree._init_single_node(name, data, parent) everywhere in my tests.) Also internally it's nice to have a separate ._init_single_node() method because that's (a) closer to the super().__init__() defined by TreeNode, and (b) doesn't require calling the fairly complex getting and setting methods.

Could those classes be merged somehow?

They were originally separate (I had DataTree and DatasetNode, where the former was a subclass of the latter), but then I merged them together in Condense DatasetNode and DataTree into a single DataTree class #11.

Zarr (abstract data store) has no such separate class and uses a regular zarr.hierarchy.Group as the root.

Good to know that other nested structures took a similar approach. I think that as we want to be able to save and load any subtree even after changing parents etc. then we ideally don't want to treat any one node as special.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
904987705 https://github.com/pydata/xarray/issues/4118#issuecomment-904987705 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X8418QQ5 TomNicholas 35968931 2021-08-24T21:25:17Z 2021-08-24T21:25:37Z MEMBER

Thanks @jhamman - expect things to break as I keep realizing certain methods have to be defined differently from in Dataset for things to work.

Help with 3 would be especially appreciated, as at the moment whilst I can open and alter a file with groups, I can't save my resulting tree :sweat_smile:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
901954045 https://github.com/pydata/xarray/issues/4118#issuecomment-901954045 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X841wrn9 TomNicholas 35968931 2021-08-19T14:16:45Z 2021-08-19T14:16:45Z MEMBER

Oh excellent, thanks for the clarification Stephan!

On Thu, 19 Aug 2021, 00:23 Stephan Hoyer, @.***> wrote:

However, if one of the variables has the same name as one of the groups (which I think is permitted in the netCDF format), then there is no easy way to access all the elements whilst retaining the nice syntax.

NetCDF does not allow variables and groups with the same name, e..g,

import netCDF4 nc = netCDF4.Dataset('testing.nc', 'w')nc.createVariable('foo', float)nc.createGroup('foo')# RuntimeError: NetCDF: String match to name in use

I'm pretty sure this is also prohibited for all HDF5 files, just like how you can't have a directory and file with the same name on most filesystems.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/4118#issuecomment-901598698, or unsubscribe https://github.com/notifications/unsubscribe-auth/AISNPI4WWZ3ZBJNKK4HMLWDT5SBMDANCNFSM4NQEIKFQ .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
901594249 https://github.com/pydata/xarray/issues/4118#issuecomment-901594249 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X841vTyJ TomNicholas 35968931 2021-08-19T04:10:30Z 2021-08-19T04:10:30Z MEMBER

I think that xarray's current use of both dict-like access and attribute-like access for variables makes representing a general netCDF file in a single DataTree incompatible with the nice syntax that @emilbiju originally suggested.

Consider a tree with a node structure for a hypothetical DataTree object dt that looks something like

python DataTree("root") |-- DatasetNode("weather") | |-- DatasetNode("temperature") | | |-- DataArrayNode("sea_surface_temperature") | | |-- DataArrayNode("dew_point_temperature") | |-- DataArrayNode("wind_speed") |-- DataArrayNode("population")

We ideally want to be able to seamlessly access both subtrees and individual variables via chains of keys, e.g. weather_subtree = dt['weather'], and wind_speed_da = dt['weather']['wind_speed']. (We want that so that each subtree behaves as much like an xarray.Dataset as possible, with respect to mapping functions over all its child nodes and so on.)

This particular example is fine, and would correspond to a netCDF file with groups "root", "root/weather", and "root/weather/temperature", plus the four stored DataArray variables.

However, if one of the variables has the same name as one of the groups (which I think is permitted in the netCDF format), then there is no easy way to access all the elements whilst retaining the nice syntax. For example consider

python DataTree("root") |-- DatasetNode("A") | |-- DatasetNode("B") | | |-- DataArrayNode("foo") | | |-- DataArrayNode("bar") | |-- DataArrayNode("B") |-- DataArrayNode("C")

Now we have a key collision between the group named "B" and the DataArray named "B", i.e. dt['A']['B'] is ambiguous.

We can't just forbid this type of tree because then there would be netCDF files that we couldn't represent as a DataTree, so we would not have the property netCDF -> xarray.DataTree -> netCDF in general.

We can't use different types of access (e.g. subtree = dt.A.B for the subtree and da = dt.A['B'] for the variable, because we've already given up the .B namespace to also point to the variable (i.e. same location as ['B']). If we break that convention it's going to be very confusing for users who are expecting the root of the DataTree to behave like xarray.Dataset currently does.

(We could divide access through __call__ like ds['A']('B') but that wouldn't be very pythonic).

The only way I can see around this is to hide a node's data variables behind a .ds property (i.e. da = dt['A'].ds['B']), or get groups via a dedicated method (i.e. subtree = dt.get_child('A')), but those are so much more ugly and less intuitive that it feels like a shame to have to do that.

It sounds like @emilbiju avoided this by not satisfying netCDF -> xarray.DataTree -> netCDF:

(Instead of using netCDF4 groups for encoding the Datatree ... within the netCDF file, it would exist just as a Dataset)

so I'm wondering if anyone else has other suggestions or thoughts?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
873492892 https://github.com/pydata/xarray/issues/4118#issuecomment-873492892 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDg3MzQ5Mjg5Mg== TomNicholas 35968931 2021-07-04T00:51:19Z 2021-07-04T00:51:19Z MEMBER

Some other thoughts about tags:

1) Does the definition of tags include variable names of DataArrays? I think it should.

2) As @martinitus mentioned, a DataTree containing only leaves with only 1 tag each is effectively a Dataset. I wonder if Dataset could be refactored to be a special case of a more general DataTree, possibly as a subclass?

3) Selecting via tags would need to allow a distinction between "get me all leaves with these exact tags" and "get me all leaves whose tags include these ones". Maybe dt.choose_only(tags) and dt.choose_all(tags)?

4) The latter type of tag-based access would make plotting different leaves against one another easier too - given a multi-resolution (or multi-model) datatree like this:

dt |-- high_res | |-- temperature | |-- CO2 |-- medium_res | |-- temperature | |-- CO2 |-- low_res | |-- temperature | |-- CO2

then assuming that the definition of tags included the DataArray variable names, then

dt.choose_all('temperature').plot.line(x='time')

would select all leaves with a tempature tag, check that the temperature DataArrays had the same dimensions (but no need for any time coordinates to share size or values), and then plot them against one another on the same axes. This would be so useful - I would say this use case is 90% of the reason users iterate over dictionaries of datasets currently.

5) With a tag-based system you can create cycles of tags, like A&B, B&C, C&A, which you can't really do with hierarchical trees. I don't think that actually causes any problems though...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
873307873 https://github.com/pydata/xarray/issues/4118#issuecomment-873307873 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDg3MzMwNzg3Mw== TomNicholas 35968931 2021-07-02T23:54:09Z 2021-07-02T23:54:09Z MEMBER

@shoyer if you used tags wouldn't you lose the ability to round-trip a netCDF file with groups? When you read in the groups from the file you would be throwing information away by going from a hierarchy A/B to simply tags A&B, and there wouldn't be a way to restore that before calling .to_netcdf() would there?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
873231425 https://github.com/pydata/xarray/issues/4118#issuecomment-873231425 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDg3MzIzMTQyNQ== TomNicholas 35968931 2021-07-02T20:05:06Z 2021-07-02T20:05:06Z MEMBER

I think using tags is a really interesting alternative to hierarchies. I don't have a clear sense of the overall tradeoffs, though.

That is interesting. I think there is an argument for using a hierarchical model to map onto the full netCDF data model with groups, but perhaps methods to select elements via tags could be included too, for the best of both?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
873179375 https://github.com/pydata/xarray/issues/4118#issuecomment-873179375 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDg3MzE3OTM3NQ== TomNicholas 35968931 2021-07-02T18:22:49Z 2021-07-02T18:22:49Z MEMBER

Flagging another possible use case, this time in Magnetic Confinement Fusion: representing the IMAS data model.

IMAS is currently closed-source (being part of the ITER project), but there is a big push to make it open-source and the standard data model for tokamak plasma data.

I'm not very familiar with IMAS (@smithsp and @orso82 are more so), but it is hierarchical. There is some more information in appendix A3 of this paper, which talks about "taking advantage of the homogeneity of grid sizes that is commonly found across arrays of structures", which sounds very closely related to the DataTree proposal.

This might allow the xarray.DataTree to do more of the heavy-lifting within OMAS (which already uses xarray, and is intended to be compatible with IMAS).

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
808366093 https://github.com/pydata/xarray/issues/4118#issuecomment-808366093 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDgwODM2NjA5Mw== TomNicholas 35968931 2021-03-26T16:47:53Z 2021-03-26T16:47:53Z MEMBER

This sounds like an interesting project - I'm also about to be able to work on xarray much more directly (thanks @rabernat ).

Should I add this as another xarray project board alongside explicit indexes and so on?

I wonder if this could find another domain use case in plasmapy as part of the overall plasma object @StanczakDominik? At the very least this would allow you to store all the various equilibrium and diagnostics information that goes in an EFIT file.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 5999.014ms · About: xarray-datasette