home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 901594249

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4118#issuecomment-901594249 https://api.github.com/repos/pydata/xarray/issues/4118 901594249 IC_kwDOAMm_X841vTyJ 35968931 2021-08-19T04:10:30Z 2021-08-19T04:10:30Z MEMBER

I think that xarray's current use of both dict-like access and attribute-like access for variables makes representing a general netCDF file in a single DataTree incompatible with the nice syntax that @emilbiju originally suggested.

Consider a tree with a node structure for a hypothetical DataTree object dt that looks something like

python DataTree("root") |-- DatasetNode("weather") | |-- DatasetNode("temperature") | | |-- DataArrayNode("sea_surface_temperature") | | |-- DataArrayNode("dew_point_temperature") | |-- DataArrayNode("wind_speed") |-- DataArrayNode("population")

We ideally want to be able to seamlessly access both subtrees and individual variables via chains of keys, e.g. weather_subtree = dt['weather'], and wind_speed_da = dt['weather']['wind_speed']. (We want that so that each subtree behaves as much like an xarray.Dataset as possible, with respect to mapping functions over all its child nodes and so on.)

This particular example is fine, and would correspond to a netCDF file with groups "root", "root/weather", and "root/weather/temperature", plus the four stored DataArray variables.

However, if one of the variables has the same name as one of the groups (which I think is permitted in the netCDF format), then there is no easy way to access all the elements whilst retaining the nice syntax. For example consider

python DataTree("root") |-- DatasetNode("A") | |-- DatasetNode("B") | | |-- DataArrayNode("foo") | | |-- DataArrayNode("bar") | |-- DataArrayNode("B") |-- DataArrayNode("C")

Now we have a key collision between the group named "B" and the DataArray named "B", i.e. dt['A']['B'] is ambiguous.

We can't just forbid this type of tree because then there would be netCDF files that we couldn't represent as a DataTree, so we would not have the property netCDF -> xarray.DataTree -> netCDF in general.

We can't use different types of access (e.g. subtree = dt.A.B for the subtree and da = dt.A['B'] for the variable, because we've already given up the .B namespace to also point to the variable (i.e. same location as ['B']). If we break that convention it's going to be very confusing for users who are expecting the root of the DataTree to behave like xarray.Dataset currently does.

(We could divide access through __call__ like ds['A']('B') but that wouldn't be very pythonic).

The only way I can see around this is to hide a node's data variables behind a .ds property (i.e. da = dt['A'].ds['B']), or get groups via a dedicated method (i.e. subtree = dt.get_child('A')), but those are so much more ugly and less intuitive that it feels like a shame to have to do that.

It sounds like @emilbiju avoided this by not satisfying netCDF -> xarray.DataTree -> netCDF:

(Instead of using netCDF4 groups for encoding the Datatree ... within the netCDF file, it would exist just as a Dataset)

so I'm wondering if anyone else has other suggestions or thoughts?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  628719058
Powered by Datasette · Queries took 0.655ms · About: xarray-datasette