home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 902009258

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
902009258 MDU6SXNzdWU5MDIwMDkyNTg= 5376 Multi-scale datasets and custom indexes 4160723 open 0     6 2021-05-26T08:38:00Z 2021-06-02T08:07:38Z   MEMBER      

I've been wondering if:

  • multi-scale datasets are generic enough to implement some related functionality in Xarray, e.g., as new Dataset and/or DataArray method(s)
  • we could leverage custom indexes for that (see the design notes)

I'm thinking of an API that would look like this:

```python

lazily load a big n-d image (full resolution) as a xarray.Dataset

xyz_dataset = ...

set a new index for the x/y/z coordinates

(reduction and pre_compute_scales are optional and passed

as arguments to ImagePyramidIndex)

xyz_dataset.set_index( ('x', 'y', 'z'), ImagePyramidIndex, reduction=np.mean, pre_compute_scales=(2, 2), )

get a slice (ImagePyramidIndex will be used to dynamically scale the data

or load the right pre-computed dataset)

xyz_slice = xyz_dataset.sel_and_rescale(x=slice(...), y=slice(...), z=slice(...)) ```

where ImagePyramidIndex is not a "common" index, i.e., it cannot be used directly with Xarray's .sel() nor for data alignment. Using an index here might still make sense for such data extraction and resampling operation IMHO. We could extend the xarray.Index API to handle multi-scale datasets, so that ImagePyramidIndex could either do the scaling dynamically (maybe using a cache) or just lazily load pre-computed data, e.g., from a NGFF / OME-Zarr dataset... Both the implementation and functionality can be pretty flexible. Custom options may be passed through the Xarray API either when creating the index or when extracting a data slice.

A hierarchical structure of xarray.Dataset objects is already discussed in #4118 for multi-scale datasets, but I'm wondering if using indexes could be an alternative approach (it could also be complementary, i.e., ImagePyramidIndex could rely on such hierarchical structure under the hood).

I'd see some advantages of the index approach, although this is the perspective from a naive user who is not working with multi-scale datasets:

  • it is flexible: the scaling may be done dynamically without having to store the results in a hierarchical collection with some predefined discrete levels
  • we don't need to expose anything other than a simple xarray.Dataset + a "black-box" index in which we abstract away all the implementation details. The API example shown above seems more intuitive to me than having to deal directly with Dataset groups.
  • Xarray will provide a plugin system for 3rd party indexes, allowing for more ImagePyramidIndex variants. Xarray already provides an extension mechanism (accessors) for methods like sel_and_rescale in the example above...

That said, I'd also see the benefits of exposing Dataset groups more transparently to users (in case those are loaded from a store that supports it).

cc @thewtex @joshmoore @d-v-b

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5376/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
    13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 6 rows from issue in issue_comments
Powered by Datasette · Queries took 0.897ms · About: xarray-datasette