id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1198668507,I_kwDOAMm_X85Hcjrb,6462,Provide protocols for creating structural subtypes of DataArray/Dataset,29104956,open,0,,,5,2022-04-09T15:09:40Z,2023-09-16T19:55:59Z,,NONE,,,,"### Is your feature request related to a problem? I frequently find myself wanting to annotate functions in terms of xarray objects that adhere to a particular schema. Given that a dataset's adherence to a schema is a matter of its structure/contents, it is unnatural to try to describe a schema as a subtype of `xr.Dataset` (or `DataArray`) (i.e. a type-checker ought not care that a dataset is an instance of a specific subclass of `Dataset`). ### Describe the solution you'd like Instead, it would be ideal to define a schema as a [Protocol (structural subtype)](https://peps.python.org/pep-0544/) of `xr.Dataset`. Unfortunately, one cannot [subclass a normal class to create a protocol](https://peps.python.org/pep-0544/#protocols-subclassing-normal-classes). Thus, I am proposing that `xarray` provide Protocol-based descriptions of `DataArray` and `Dataset` so that users can describe schemas as **structural subtypes** of these classes. E.g. ```python from typing import Protocol from xarray import DataArray from xarray.typing import DatasetProtocol class ClimateData(DatasetProtocol, Protocol): lat: DataArray lon: DataArray temp: DataArray precip: DataArray def process_climate_data(ds: ClimateData): ds.banana # type checker flags as unknown attribute ds.temp # type checker sees ""DataArray"" (as informed by ClimateData) ds.sel(lat=1.0) # type checker sees `Dataset` (as informed by `DatasetProtocol`) ``` The contents of `DatasetProtocol` would essentially look like a modified type stub for `xarray.Dataset` so the implementation details are relatively simple, I believe. ### Describe alternatives you've considered Creating a strict subtype of `Dataset` is not ideal for a few reasons: 1. Static type checkers would then expect to see that datasets must derive from that particular subclass, which is generally not the case. 2. The annotations / design of `xarray.Dataset` is too broad for describing a schema. E.g. the presence of `__getattr__` prevents type checkers from flagging access to non-existent data variables and coordinates during static analysis. `DatasetProtocol` would need to be designed to be less permissive than this. ### Additional context Hopefully this could be leveraged by the likes of [xarray-schema](https://github.com/carbonplan/xarray-schema) so that xarray schemas can be used to provide both runtime *and* static validation capabilities. I'd love to get feedback on this, and would be happy to open a PR if xarray devs are willing to weigh in on the design of these protocols.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6462/reactions"", ""total_count"": 11, ""+1"": 11, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue