id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1198668507,I_kwDOAMm_X85Hcjrb,6462,Provide protocols for creating structural subtypes of DataArray/Dataset,29104956,open,0,,,5,2022-04-09T15:09:40Z,2023-09-16T19:55:59Z,,NONE,,,,"### Is your feature request related to a problem? I frequently find myself wanting to annotate functions in terms of xarray objects that adhere to a particular schema. Given that a dataset's adherence to a schema is a matter of its structure/contents, it is unnatural to try to describe a schema as a subtype of `xr.Dataset` (or `DataArray`) (i.e. a type-checker ought not care that a dataset is an instance of a specific subclass of `Dataset`). ### Describe the solution you'd like Instead, it would be ideal to define a schema as a [Protocol (structural subtype)](https://peps.python.org/pep-0544/) of `xr.Dataset`. Unfortunately, one cannot [subclass a normal class to create a protocol](https://peps.python.org/pep-0544/#protocols-subclassing-normal-classes). Thus, I am proposing that `xarray` provide Protocol-based descriptions of `DataArray` and `Dataset` so that users can describe schemas as **structural subtypes** of these classes. E.g. ```python from typing import Protocol from xarray import DataArray from xarray.typing import DatasetProtocol class ClimateData(DatasetProtocol, Protocol): lat: DataArray lon: DataArray temp: DataArray precip: DataArray def process_climate_data(ds: ClimateData): ds.banana # type checker flags as unknown attribute ds.temp # type checker sees ""DataArray"" (as informed by ClimateData) ds.sel(lat=1.0) # type checker sees `Dataset` (as informed by `DatasetProtocol`) ``` The contents of `DatasetProtocol` would essentially look like a modified type stub for `xarray.Dataset` so the implementation details are relatively simple, I believe. ### Describe alternatives you've considered Creating a strict subtype of `Dataset` is not ideal for a few reasons: 1. Static type checkers would then expect to see that datasets must derive from that particular subclass, which is generally not the case. 2. The annotations / design of `xarray.Dataset` is too broad for describing a schema. E.g. the presence of `__getattr__` prevents type checkers from flagging access to non-existent data variables and coordinates during static analysis. `DatasetProtocol` would need to be designed to be less permissive than this. ### Additional context Hopefully this could be leveraged by the likes of [xarray-schema](https://github.com/carbonplan/xarray-schema) so that xarray schemas can be used to provide both runtime *and* static validation capabilities. I'd love to get feedback on this, and would be happy to open a PR if xarray devs are willing to weigh in on the design of these protocols.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6462/reactions"", ""total_count"": 11, ""+1"": 11, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 1226931933,I_kwDOAMm_X85JIX7d,6576,Basic examples for creating data structures fail type-checking,29104956,closed,0,,,2,2022-05-05T16:42:00Z,2022-05-27T18:01:33Z,2022-05-27T18:01:33Z,NONE,,,,"### What happened? The examples provided by [this documentation](https://docs.xarray.dev/en/stable/user-guide/data-structures.html) reveal issues with the type-annotations for `DataArray` and `Dataset`. Running mypy and pyright on these basic use-cases, only slightly modified, produce type-checking errors. ### What did you expect to happen? The annotations for these classes should accommodate these common use-cases. ### Minimal Complete Verifiable Example ```Python # run mypy or pyright on the following file to reproduce the errors import numpy as np import xarray as xr import pandas as pd data = np.random.rand(4, 3) locs = [""IA"", ""IL"", ""IN""] times = pd.date_range(""2000-01-01"", periods=4) foo = xr.DataArray( data, coords=[times, locs], # error: List item 1 has incompatible type ""List[str]""; expected ""Tuple[Any, ...]"" dims=[""time"", ""space""], ) temp = 15 + 8 * np.random.randn(2, 2, 3) precip = 10 * np.random.rand(2, 2, 3) lon = [[-99.83, -99.32], [-99.79, -99.23]] lat = [[42.25, 42.21], [42.63, 42.59]] A = { ""temperature"": ([""x"", ""y"", ""time""], temp), ""precipitation"": ([""x"", ""y"", ""time""], precip), } C = { ""lon"": ([""x"", ""y""], lon), ""lat"": ([""x"", ""y""], lat), ""time"": pd.date_range(""2014-09-06"", periods=3), ""reference_time"": pd.Timestamp(""2014-09-05""), } ds = xr.Dataset( A, # error: Argument 1 to ""Dataset"" has incompatible type ""Dict[str, Tuple[List[str], Any]]""; expected ""Optional[Mapping[Hashable, Any]]"" coords=C, # error: Argument ""coords"" to ""Dataset"" has incompatible type ""Dict[str, Any]""; expected ""Optional[Mapping[Hashable, Any]]"" ) ``` ### MVCE confirmation - [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [x] Complete example — the example is self-contained, including all data and the text of any traceback. - [x] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [x] New issue — a search of GitHub Issues suggests this is not a duplicate. ### Relevant log output _No response_ ### Anything else we need to know? Some of these errors are circumvented when one provides a literal inline, and thus exploit [bidrectional inference](https://github.com/microsoft/pyright/blob/main/docs/type-inference.md#bidirectional-type-inference-expected-types), which may be why the current mypy tests ran in your CI miss these. E.g. ```python from typing import Dict, Hashable, Any def f(x: Dict[Hashable, Any]): ... f({""hi"": 1}) # this is ok -- uses bidirectional inference to see Dict[Hashable, Any] x = {""hi"": 1} f(x) # error: Dict[Hashable, Any] is invariant in Hashable, and is incompatible with str ``` This is a sticky situation as key is invariant even in `Mapping`: https://github.com/python/typing/issues/445. IMHO it would be great to tweak these annotations, e.g. `Hashable -> Hashable | str | ` to ensure that users don't face such false positives. ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-153-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 0.19.0 pandas: 1.3.3 numpy: 1.20.3 scipy: 1.7.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: None distributed: None matplotlib: 3.5.2 cartopy: None seaborn: None numbagg: None pint: None setuptools: 59.5.0 pip: 21.3 conda: None pytest: 6.2.5 IPython: 7.28.0 sphinx: 4.5.0
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6576/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 634869703,MDU6SXNzdWU2MzQ4Njk3MDM=,4131,Why am I able to load data from a closed dataset?,29104956,closed,0,,,10,2020-06-08T19:14:46Z,2022-04-05T18:35:06Z,2022-04-05T18:35:06Z,NONE,,,," I don't understand why I am able to open and close a dataset, but then proceed to read data from said dataset. I can open a 4 GB dataset and promptly close is, and then still access the data within, which appears to still be loading lazily. Does querying a closed dataset automatically reopen it? #### MCVE Code Sample ```python import numpy as np import xarray as xr ds = xr.Dataset({""foo"": ((""x"",), np.random.rand(4,))}, coords={""x"": [10, 20, 30, 40]}) ds.to_netcdf(""tmp_example.nc"") ``` ```python >>> data = xr.open_dataset(""tmp_example.nc"") >>> data.close() >>> data.foo array([0.894788, 0.017935, 0.696086, 0.827004]) Coordinates: * x (x) int64 10 20 30 40 ``` #### Expected Output Because netCDF data sets are loaded lazily, I would imagine that, having not been touched when opened, that closing the data set would render it inaccessible #### Versions
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.4.0-166-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.2 libnetcdf: 4.6.1 xarray: 0.15.0 pandas: 1.0.3 numpy: 1.16.3 scipy: 1.4.1 netCDF4: 1.4.1 pydap: None h5netcdf: None h5py: 2.8.0 Nio: None zarr: None cftime: 1.1.3 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.14.0 distributed: None matplotlib: 3.0.3 cartopy: None seaborn: None numbagg: None setuptools: 46.1.3.post20200330 pip: 20.0.2 conda: None pytest: 5.4.1 IPython: 7.5.0 sphinx: 2.4.4
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4131/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue