home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 1340701570

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/7363#issuecomment-1340701570 https://api.github.com/repos/pydata/xarray/issues/7363 1340701570 IC_kwDOAMm_X85P6XuC 8382834 2022-12-07T10:12:07Z 2022-12-07T10:12:20Z CONTRIBUTOR

Regarding your questions @kmuehlbauer :

  • the dataset does fit in memory, and dumped as netCDF on the hard drive, it does take about 2.7GB as you say :) .
  • I am not using dask backed arrays
  • I think you are right, I think that the typical RAM consumption due to this thing is maybe around the 6GB you mention :)
  • I did a bit of testing and it looks like VSC (that I am using to run the notebooks) is wasting a lot of RAM in particular when there is large cell outputs, so that may be the biggest culprit here...

Regarding your question @keewis : nothing special here:

In [] repr(timestamps_extended_basis) Out [] 'array([-315619200, -315618600, -315618000, ..., 1667259000, 1667259600,\n 1667260200])'

but it is quite big:

In [] len(timestamps_extended_basis) Out [] 3304800

Regarding the points of discussion / suggestions:

  • I think that the suggestion of @keewis to use concat is nice. This is probably how I will solve things for now :) .
  • @kmuehlbauer is it so surprising that the call to reindex is really slow? :) I am not sure of how reindex tries to find the match between new and previous indexes, but if does a lookup for each new index of "does this new index exist in the old indexes" by just walking through them, this is potentially a heavy computational cost, right? (ie trying 3.5e6 times to see if an element is present among 3e6 elements, right?). I do not know how this is implemented in practice (for example, is it possible that reindex first sorts the previous indexes (but they have to have an ordering relation then), then uses a dichotomy search and not a naive search? that would cut down complexity quite a bit). But in all case, when just adding new indexes at the end of existing indexes, keeping old indexes unchanged, this will always be quite a lot more work than just concat / extending the arrays, right? :)

My feeling is that while concat works, it may be an operation that is common enough that there may be interest in implementing a "grow_coordinate" function to grow / reallocate larger arrays copying the previous chunk along a coordinate, as a usability / convenience feature? Something like:

xr_dataset.grow_coordinate(coordinate_grow_along="time", location_extra_capacity="end", default_fill=1.0e37)

which would grow the coordinate "time" itself and all data variables that depend on it, adding the default filled extra entries at the end. Not sure if this should be relative to coordinates, or dimensions - a bit n00b on this, and always confused at coordinates vs dimensions.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  1479121713
Powered by Datasette · Queries took 3.048ms · About: xarray-datasette