home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where author_association = "MEMBER" and issue = 1108138101 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • TomNicholas 3
  • shoyer 1
  • Illviljan 1

issue 1

  • [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation · 5 ✖

author_association 1

  • MEMBER · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1028136906 https://github.com/pydata/xarray/issues/6174#issuecomment-1028136906 https://api.github.com/repos/pydata/xarray/issues/6174 IC_kwDOAMm_X849SB_K shoyer 1217238 2022-02-02T16:46:24Z 2022-02-02T17:20:50Z MEMBER

Have you seen xarray.save_mfdataset?

In principle, it was designed for exactly this sort of thing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 1108138101
1019311097 https://github.com/pydata/xarray/issues/6174#issuecomment-1019311097 https://api.github.com/repos/pydata/xarray/issues/6174 IC_kwDOAMm_X848wXP5 Illviljan 14371165 2022-01-22T17:09:30Z 2022-01-22T17:09:30Z MEMBER

Is it that difficult to get a list of groups though? I've been testing a backend engine that merges many groups into 1 dataset (dims/coords/variables renamed slightly to avoid duplicate names until they've been interpolated together) using h5py.

Getting the groups are like the first thing you have to do, the code would look something like this: ```python

f = h5py.File('foo.hdf5','w') f.name '/' list(f.keys()) [] ``` https://docs.h5py.org/en/stable/high/group.html

Sure, it can be quite tiresome to navigate the backend engines and 3rd party modules in xarray to add this. But most of them uses h5py or something quite similar at its core so it shouldn't be THAT bad.

For example one could add another method here that retrieves them in a quick and easy way: https://github.com/pydata/xarray/blob/c54123772817875678ec7ad769e6d4d6612aeb92/xarray/backends/common.py#L356-L360

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 1108138101
1018681263 https://github.com/pydata/xarray/issues/6174#issuecomment-1018681263 https://api.github.com/repos/pydata/xarray/issues/6174 IC_kwDOAMm_X848t9ev TomNicholas 35968931 2022-01-21T16:48:42Z 2022-01-21T16:48:42Z MEMBER

I don't think our project would add DataTree as a new dependency just for this as long as we have a very easy and viable solution of ourselves.

FYI the plan with DataTree is to eventually integrate the work upstream into xarray, so no new dependency would be required at that point. That might take a while however.

If this would be communicated more transparently in the docstrings, it would bring us a big step closer to the solution of this issue

That's good at least! Do you have any suggestions for where the docs should be improved? PRs are of course always welcome too :grin:

one problem left: Getting a full list of all groups contained in a NetCDF4 file so that we can read them all in.

I would insist that xarray should be able to do this. Maybe we need a open_datasets_from_groups function for that, or rather a function list_datasets. But it should somehow be solvable within the xarray API without requiring a two-year debate about the management and representation of hierarchical data structures.

I agree, and would be open to a function like this (even if eventually DataTree renders it redundant). It's definitely an omission on our part that xarray still doesn't provide an easy way to do this - I've found myself wanting to easily see all the groups multiple times. However, my understanding is that it's slightly tricky to implement, though suggestions/corrections are welcome!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 1108138101
1017782089 https://github.com/pydata/xarray/issues/6174#issuecomment-1017782089 https://api.github.com/repos/pydata/xarray/issues/6174 IC_kwDOAMm_X848qh9J TomNicholas 35968931 2022-01-20T18:11:26Z 2022-01-20T18:12:32Z MEMBER

In my case, we are talking about a very unusual application of the NetCDF4 groups feature: We store literally thousands of very small NetCDF datasets in a single file. A file containing 3000 datasets is typically not larger than 100 MB.

Ah - thanks for the clarification as to the context @tovogt !

So, my request is really about the I/O performance, and I don't need a full-fledged hierarchical data management API in xarray for that.

That's fair enough.

On our cluster this means that writing that 100 MB file takes 10 hours with your DataTree implementation, and 30 minutes with my helper functions. For reading, the effect is smaller, but still noticeable.

So are you asking if: a) We should add a function to xarray which uses the same trick your helper functions do, for when people have a similar problem to you? b) We should use the same trick your helper functions do to rewrite the I/O implementation of DataTree to only require one open/close? (It seems to me that this could be the best of both worlds, once implemented.) c) Whether there is some other way to do this even faster than your helper functions?

EDIT: Tagging @alexamici / @aurghs for their backends expertise + interest in DataTree

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 1108138101
1016705107 https://github.com/pydata/xarray/issues/6174#issuecomment-1016705107 https://api.github.com/repos/pydata/xarray/issues/6174 IC_kwDOAMm_X848mbBT TomNicholas 35968931 2022-01-19T17:37:12Z 2022-01-19T18:05:07Z MEMBER

I would like to have a function xr.to_netcdf that writes a list (or a dictionary) of datasets to a single NetCDF4 file.

If you've read through all of #4118 you will have seen that there is a prototype package providing a nested data structure which can handle groups. Using DataTree we can easily write a dictionary of datasets to a single netCDF file as groups:

```python from datatree import DataTree

dt = DataTree.from_dict(ds_dict) dt.to_netcdf('filepath.nc') ```

(Here if you want groups within groups then the keys in the dictionary should be specified like filepaths, e.g. /group1/group2/ds_name.)

Ideally there should also be a way to read many datasets at once from a single NetCDF4 file using xr.open_dataset.

Again DataTree allows you to open all the groups at once, returning a tree-like structure which contains all the groups:

python dt = open_datatree('filepath.nc')

To extract all the groups as individual datasets you can do this to recreate the dictionary of datasets:

python ds_dict = {node.pathstr: node.ds for node in dt.subtree}

However, this is really slow when you have many (hundreds or thousands of) small datasets because the file is opened and closed in every iteration.

Currently, I'm using the following read/write functions to achieve the same:

Is your solution noticeably faster? We (@jhamman and I) haven't really thought about speed of DataTree I/O yet I don't think, preferring to just make something simple which works for now. The current I/O code for DataTree is here.

Despite that project only being a prototype, it is still probably the best solution to your problem that we currently have (at least the neatest). If you are interested in trying it out and reporting any problems then that would be greatly appreciated!

EDIT: The idea discussed here might also be of interest to you.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 1108138101

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.737ms · About: xarray-datasette