home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 290142369

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1092#issuecomment-290142369 https://api.github.com/repos/pydata/xarray/issues/1092 290142369 MDEyOklzc3VlQ29tbWVudDI5MDE0MjM2OQ== 1217238 2017-03-29T16:18:12Z 2017-03-29T16:18:12Z MEMBER

@shoyer do you have an idea on how it would work with serialization to netCDF?

With netCDF4, we could potentially just use groups. Or we could use some sort of naming convention for strings, e.g., joining together together the parts of the tuple with ..

One challenge here is that unless we also let dimensions be group specific, not every netCDF4 file with groups corresponds to a valid xarray Dataset: you can have conflicting sizes on dimensions for netCDF4 files in different groups.

In principle, it could be OK to use tuples for dimension names, but we already have lots of logic that distinguishes between single and multiple dimensions by looking for non-strings or tuples. So you would probably have to write ds.sum(dim=[('flux', 'time')]) if you wanted to sum over the 'time' dimension of the 'flux' group. https://github.com/pydata/xarray/issues/1231 would help here (e.g., to enable ds.sel({('flux', 'time'): time})), but cases like ds.sum(dim=('flux', 'time')) would still be a source of confusion.

How to handle dimensions and coordinate names when assigning groups is clearly one of the important design decisions here. It's obvious that data variables should be grouped but less clear how to handle dimensions/coordinates.

We would also have to decide how to display groups in the repr of the flat dataset...

Some sort of further indentation seems natural, possibly with truncation like ... for cases when the number of variables is very long (>10), e.g., Data variables: flux poloidal perturbed

This is another case where an HTML repr could be powerful, allowing for clearer visual links and potentially interactive expanding/contracting of the tree.

Would the domain for this just be to simulate the tree-like structure that NetCDF permits, or could it extend to multiple datasets on disk?

From xarray's perspective, there isn't really a distinction between multiple files and groups in one netCDF file -- it's just a matter of creating a Dataset with data organized in a different way. Presumably we could write helper methods for converting a dimension into a group level (and vice-versa).

But it's worth noting that there still limitations to opening large numbers of files in a single dataset, even with groups, because xarray reads all the metadata for every variable into memory at once, and that metadata is copied in every xarray operation. For this reason, you will still probably want a different datastructure (convertible into an xarray.Dataset) when navigating very large datasets like CMIP, which consists of many thousands of files.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  187859705
Powered by Datasette · Queries took 1.153ms · About: xarray-datasette