home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

3 rows where user = 29104956 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

state 2

  • closed 2
  • open 1

type 1

  • issue 3

repo 1

  • xarray 3
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1198668507 I_kwDOAMm_X85Hcjrb 6462 Provide protocols for creating structural subtypes of DataArray/Dataset rsokl 29104956 open 0     5 2022-04-09T15:09:40Z 2023-09-16T19:55:59Z   NONE      

Is your feature request related to a problem?

I frequently find myself wanting to annotate functions in terms of xarray objects that adhere to a particular schema. Given that a dataset's adherence to a schema is a matter of its structure/contents, it is unnatural to try to describe a schema as a subtype of xr.Dataset (or DataArray) (i.e. a type-checker ought not care that a dataset is an instance of a specific subclass of Dataset).

Describe the solution you'd like

Instead, it would be ideal to define a schema as a Protocol (structural subtype) of xr.Dataset. Unfortunately, one cannot subclass a normal class to create a protocol.

Thus, I am proposing that xarray provide Protocol-based descriptions of DataArray and Dataset so that users can describe schemas as structural subtypes of these classes. E.g.

```python from typing import Protocol

from xarray import DataArray from xarray.typing import DatasetProtocol

class ClimateData(DatasetProtocol, Protocol): lat: DataArray lon: DataArray temp: DataArray precip: DataArray

def process_climate_data(ds: ClimateData): ds.banana # type checker flags as unknown attribute ds.temp # type checker sees "DataArray" (as informed by ClimateData) ds.sel(lat=1.0) # type checker sees Dataset (as informed by DatasetProtocol) ```

The contents of DatasetProtocol would essentially look like a modified type stub for xarray.Dataset so the implementation details are relatively simple, I believe.

Describe alternatives you've considered

Creating a strict subtype of Dataset is not ideal for a few reasons:

  1. Static type checkers would then expect to see that datasets must derive from that particular subclass, which is generally not the case.
  2. The annotations / design of xarray.Dataset is too broad for describing a schema. E.g. the presence of __getattr__ prevents type checkers from flagging access to non-existent data variables and coordinates during static analysis. DatasetProtocol would need to be designed to be less permissive than this.

Additional context

Hopefully this could be leveraged by the likes of xarray-schema so that xarray schemas can be used to provide both runtime and static validation capabilities.

I'd love to get feedback on this, and would be happy to open a PR if xarray devs are willing to weigh in on the design of these protocols.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6462/reactions",
    "total_count": 11,
    "+1": 11,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1226931933 I_kwDOAMm_X85JIX7d 6576 Basic examples for creating data structures fail type-checking rsokl 29104956 closed 0     2 2022-05-05T16:42:00Z 2022-05-27T18:01:33Z 2022-05-27T18:01:33Z NONE      

What happened?

The examples provided by this documentation reveal issues with the type-annotations for DataArray and Dataset. Running mypy and pyright on these basic use-cases, only slightly modified, produce type-checking errors.

What did you expect to happen?

The annotations for these classes should accommodate these common use-cases.

Minimal Complete Verifiable Example

```Python

run mypy or pyright on the following file to reproduce the errors

import numpy as np import xarray as xr import pandas as pd

data = np.random.rand(4, 3) locs = ["IA", "IL", "IN"] times = pd.date_range("2000-01-01", periods=4)

foo = xr.DataArray( data, coords=[times, locs], # error: List item 1 has incompatible type "List[str]"; expected "Tuple[Any, ...]" dims=["time", "space"], )

temp = 15 + 8 * np.random.randn(2, 2, 3) precip = 10 * np.random.rand(2, 2, 3) lon = [[-99.83, -99.32], [-99.79, -99.23]] lat = [[42.25, 42.21], [42.63, 42.59]]

A = { "temperature": (["x", "y", "time"], temp), "precipitation": (["x", "y", "time"], precip), }

C = { "lon": (["x", "y"], lon), "lat": (["x", "y"], lat), "time": pd.date_range("2014-09-06", periods=3), "reference_time": pd.Timestamp("2014-09-05"), }

ds = xr.Dataset( A, # error: Argument 1 to "Dataset" has incompatible type "Dict[str, Tuple[List[str], Any]]"; expected "Optional[Mapping[Hashable, Any]]" coords=C, # error: Argument "coords" to "Dataset" has incompatible type "Dict[str, Any]"; expected "Optional[Mapping[Hashable, Any]]" ) ```

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

Some of these errors are circumvented when one provides a literal inline, and thus exploit bidrectional inference, which may be why the current mypy tests ran in your CI miss these.

E.g.

```python from typing import Dict, Hashable, Any

def f(x: Dict[Hashable, Any]): ...

f({"hi": 1}) # this is ok -- uses bidirectional inference to see Dict[Hashable, Any]

x = {"hi": 1} f(x) # error: Dict[Hashable, Any] is invariant in Hashable, and is incompatible with str ```

This is a sticky situation as key is invariant even in Mapping: https://github.com/python/typing/issues/445. IMHO it would be great to tweak these annotations, e.g. Hashable -> Hashable | str | <other common coord types> to ensure that users don't face such false positives.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-153-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 0.19.0 pandas: 1.3.3 numpy: 1.20.3 scipy: 1.7.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: None distributed: None matplotlib: 3.5.2 cartopy: None seaborn: None numbagg: None pint: None setuptools: 59.5.0 pip: 21.3 conda: None pytest: 6.2.5 IPython: 7.28.0 sphinx: 4.5.0
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6576/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
634869703 MDU6SXNzdWU2MzQ4Njk3MDM= 4131 Why am I able to load data from a closed dataset? rsokl 29104956 closed 0     10 2020-06-08T19:14:46Z 2022-04-05T18:35:06Z 2022-04-05T18:35:06Z NONE      

I don't understand why I am able to open and close a dataset, but then proceed to read data from said dataset.

I can open a 4 GB dataset and promptly close is, and then still access the data within, which appears to still be loading lazily. Does querying a closed dataset automatically reopen it?

MCVE Code Sample

```python import numpy as np import xarray as xr

ds = xr.Dataset({"foo": (("x",), np.random.rand(4,))}, coords={"x": [10, 20, 30, 40]}) ds.to_netcdf("tmp_example.nc") python

data = xr.open_dataset("tmp_example.nc") data.close() data.foo <xarray.DataArray 'foo' (x: 4)> array([0.894788, 0.017935, 0.696086, 0.827004]) Coordinates: * x (x) int64 10 20 30 40 ```

Expected Output

Because netCDF data sets are loaded lazily, I would imagine that, having not been touched when opened, that closing the data set would render it inaccessible

Versions

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.4.0-166-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.2 libnetcdf: 4.6.1 xarray: 0.15.0 pandas: 1.0.3 numpy: 1.16.3 scipy: 1.4.1 netCDF4: 1.4.1 pydap: None h5netcdf: None h5py: 2.8.0 Nio: None zarr: None cftime: 1.1.3 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.14.0 distributed: None matplotlib: 3.0.3 cartopy: None seaborn: None numbagg: None setuptools: 46.1.3.post20200330 pip: 20.0.2 conda: None pytest: 5.4.1 IPython: 7.5.0 sphinx: 2.4.4
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4131/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 1351.679ms · About: xarray-datasette