home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 842436143

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
842436143 MDU6SXNzdWU4NDI0MzYxNDM= 5081 Lazy indexing arrays as a stand-alone package 1217238 open 0     6 2021-03-27T07:06:03Z 2023-12-15T13:20:03Z   MEMBER      

From @rabernat on Twitter:

"Xarray has some secret private classes for lazily indexing / wrapping arrays that are so useful I think they should be broken out into a standalone package. https://github.com/pydata/xarray/blob/master/xarray/core/indexing.py#L516"

The idea here is create a first-class "duck array" library for lazy indexing that could replace xarray's internal classes for lazy indexing. This would be in some ways similar to dask.array, but much simpler, because it doesn't have to worry about parallel computing.

Desired features:

  • Lazy indexing
  • Lazy transposes
  • Lazy concatenation (#4628) and stacking
  • Lazy vectorized operations (e.g., unary and binary arithmetic)
    • needed for decoding variables from disk (xarray.encoding) and
    • building lazy multi-dimensional coordinate arrays corresponding to map projections (#3620)
  • Maybe: lazy reshapes (#4113)

A common feature of these operations is they can (and almost always should) be fused with indexing: if N elements are selected via indexing, only O(N) compute and memory is required to produce them, regards of the size of the original arrays as long as the number of applied operations can be treated as a constant. Memory access is significantly slower than compute on modern hardware, so recomputing these operations on the fly is almost always a good idea.

Out of scope: lazy computation when indexing could require access to many more elements to compute the desired value than are returned. For example, mean() probably should not be lazy, because that could involve computation of a very large number of elements that one might want to cache.

This is valuable functionality for Xarray for two reasons:

  1. It allows for "previewing" small bits of data loaded from disk or remote storage, even if that data needs some form of cheap "decoding" from its form on disk.
  2. It allows for xarray to decode data in a lazy fashion that is compatible with full-featured systems for lazy computation (e.g., Dask), without requiring the user to choose dask when reading the data.

Related issues:

  • [Proposal] Expose Variable without Pandas dependency #3981
  • Lazy concatenation of arrays #4628
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5081/reactions",
    "total_count": 6,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 6,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 4 rows from issue in issue_comments
Powered by Datasette · Queries took 78.496ms · About: xarray-datasette