home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 597475005

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
597475005 MDU6SXNzdWU1OTc0NzUwMDU= 3959 Extending Xarray for domain-specific toolkits 6130352 closed 0     10 2020-04-09T18:34:34Z 2020-04-13T16:36:33Z 2020-04-13T16:36:32Z NONE      

Hi, I have a question about how to design an API over Xarray for a domain-specific use case (in genetics). Having seen the following now:

  • Extending xarray
  • subclassing DataSet?
  • Subclassing Dataset and DataArray (issue #706)
  • Decorators for registering custom accessors in xarray (PR #806)

I wanted to reach out and seek some advice on what I'd like to do given that I don't think any of the solutions there are what I'm looking for.

More specifically, I would like to model the datasets we work with as xr.Dataset subtypes but I'd like to enforce certain preconditions for those types as well as support conversions between them. An example would be that I may have a domain-specific type GenotypeDataset that should always contain 3 DataArrays and each of those arrays should meet different dtype and dimensionality constraints. That type may be converted to another type, say HaplotypeDataset, where the underlying data goes through some kind of transformation to produce a lower dimensional form more amenable to a specific class of algorithms.

One API I envision around these models consists of functions that enforce nominal typing on Xarray classes, so in that case I don't actually care if my subtypes are preserved by Xarray when operations are run. It would be nice if that subtyping wasn't lost but I can understand that it's a limitation for now. Here's an example of what I mean:

```python from genetics import api

arr1 = ??? # some 3D integer DataArray of allele indices arr2 = ??? # A missing data boolean DataArray arr3 = ??? # Some other domain-specific stuff like variant phasing ds = api.GenotypeDataset(arr1, arr2, arr3)

A function that would be in the API would look like:

def analyze_haplotype(ds: xr.Dataset) -> xr.Dataset: # Do stuff assuming that the user has supplied a dataset compliant with # the "HaplotypeDataset" constraints pass

analyze_haplotype(ds.to_haplotype_dataset()) ```

I like the idea of trying to avoid requiring API-specific data structures for all functionality in favor of conventions over Xarray data structures. I think conveniences like these subtypes would be great for enforcing those conventions (rather than checking at the beginning of each function) as well as making it easier to go between representations, but I'm certainly open to suggestion. I think something akin to structural subtyping that extends to what arrays are contained in the Dataset, how coordinates are named, what datatypes are used, etc. would be great but I have no idea if that's possible.

All that said, is it still a bad idea to try to subclass Xarray data structures even if the intent was never to touch any part of the internal APIs? I noticed Xarray does some stuff like type(array)(...) internally but that's the only catch I've found so far (which I worked around by dispatching to constructors based on the arguments given).

cc: @alimanfoo - Alistair raised some concerns about trying this to me, so he may have some thoughts here too

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3959/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 10 rows from issue in issue_comments
Powered by Datasette · Queries took 0.845ms · About: xarray-datasette