home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 1017782089

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/6174#issuecomment-1017782089 https://api.github.com/repos/pydata/xarray/issues/6174 1017782089 IC_kwDOAMm_X848qh9J 35968931 2022-01-20T18:11:26Z 2022-01-20T18:12:32Z MEMBER

In my case, we are talking about a very unusual application of the NetCDF4 groups feature: We store literally thousands of very small NetCDF datasets in a single file. A file containing 3000 datasets is typically not larger than 100 MB.

Ah - thanks for the clarification as to the context @tovogt !

So, my request is really about the I/O performance, and I don't need a full-fledged hierarchical data management API in xarray for that.

That's fair enough.

On our cluster this means that writing that 100 MB file takes 10 hours with your DataTree implementation, and 30 minutes with my helper functions. For reading, the effect is smaller, but still noticeable.

So are you asking if: a) We should add a function to xarray which uses the same trick your helper functions do, for when people have a similar problem to you? b) We should use the same trick your helper functions do to rewrite the I/O implementation of DataTree to only require one open/close? (It seems to me that this could be the best of both worlds, once implemented.) c) Whether there is some other way to do this even faster than your helper functions?

EDIT: Tagging @alexamici / @aurghs for their backends expertise + interest in DataTree

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  1108138101
Powered by Datasette · Queries took 0.647ms · About: xarray-datasette