home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 1039572760

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4118#issuecomment-1039572760 https://api.github.com/repos/pydata/xarray/issues/4118 1039572760 IC_kwDOAMm_X8499p8Y 35968931 2022-02-14T21:19:56Z 2022-02-14T21:40:21Z MEMBER

We would like some opinions from the community on two different possible models for a tree-like structure in xarray.

A tree contains many groups, but the question is what constraints should be imposed on the contents of those groups.

  • Option (1) - Each group is a Dataset

    • Means that within each group the same restrictions apply as currently do within a single dataset, i.e. each dimension name is only associated with a single length, so there is effectively a common set of dimensions which variables can depend on.
    • Can't represent all files, in particular can't represent a filetype where groups are allowed to have variables with inconsistent length dimensions (e.g. Zarr stores allow this as all arrays are independent.)
    • Model maps more directly onto netCDF (though still not exactly, because netCDF has dimensions as separate objects)
    • This means that sometimes you might need to put variables in ajdacent groups in the same level of the tree, when you might rather want them together in the same group.
    • Enforcing consistency between variables guarantees certain operations are always well-defined (in particular selection via an integer index like in .isel).
    • Guarantees that all valid operations on a Dataset are also valid operations on a single group of a DataTree - so API can be essentially identical to Dataset.
    • Metadata (i.e. .attrs) are arguably most useful when set at this level
    • Mental model is a (nested) dict of Datasets
    • Prototype is DataTree
  • Option (2) - Variables within groups are unconstrained

    • Means that within a single group each Variable can have any dimensions, of any length. There is no requirement that two variables which both depend on a dimension called "x" have to have the same length, one variable can have .sizes['x']=10 and the other have .sizes['x']=20.
    • The main advantage of this is that it can represent a wider set of files (including all Zarr stores and a wider set of GRIB files)
    • Model maps more directly onto HDF5
    • Doesn't enforce the (arguably fairly arbitrary) constraint that if variables have a dimension of the same name, that dimension must also be the same length
    • Without consistency selection becomes ill-defined, but many other operations are fine (e.g. taking .mean())
    • Mental model is a (nested) dict of dicts of DataArrays
    • Prototype is xarray-DataGroups

This is by no means the only question, and we have various choices to make within these options.

The questions for the potential users here are: - Do you have use cases which one of these designs could handle but the other couldn't? - How important to you is being able to support all valid files of these certain formats? - Which of these designs is clearer/more intuitive/more appealing to you?

(@alexamici , @shoyer, @jhamman, @aurghs please edit this comment to add anything I've missed)

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 2,
    "rocket": 0,
    "eyes": 0
}
  628719058
Powered by Datasette · Queries took 1.662ms · About: xarray-datasette