home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 267628781

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
267628781 MDU6SXNzdWUyNjc2Mjg3ODE= 1650 Low memory/out-of-core index? 703554 open 0     17 2017-10-23T11:13:06Z 2023-08-29T11:43:19Z   CONTRIBUTOR      

Has anyone considered implementing an index for monotonic data that does not require loading all values into main memory?

Motivation: We have data where first dimension can be length ~100,000,000, and coordinates for this dimension are stored as 32-bit integers. Currently if we used a pandas Index this would cast to 64-bit integers, and the index would require ~1GB RAM. This isn't enormous, but isn't negligible for people working on modest computers. Our use cases are simple, typically we only ever need to locate a slice of this dimension from a pair of coordinates, i.e., we only need to do binary search (bisect) on the coordinates. To achieve binary search in fact there is no need at all to load the coordinate values into memory, they could be left on disk (e.g., in HDF5 or Zarr dataset) and still achieve perfectly adequate performance for our needs.

This is of course also relevant to pandas but thought I'd post here as I know there have been some discussions about how to handle indexes when working with larger datasets via dask.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1650/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 8 rows from issue in issue_comments
Powered by Datasette · Queries took 0.638ms · About: xarray-datasette