home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

4 rows where state = "open", type = "issue" and user = 8382834 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

type 1

  • issue · 4 ✖

state 1

  • open · 4 ✖

repo 1

  • xarray 4
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1951543761 I_kwDOAMm_X850UjHR 8335 ```DataArray.sel``` can silently pick up the nearest point, even if it is far away and the query is out of bounds jerabaul29 8382834 open 0     13 2023-10-19T08:02:44Z 2024-04-29T23:02:31Z   CONTRIBUTOR      

What is your issue?

@paulina-t (who found a bug caused by the behavior we report here in a codebase, where it was badly messing things up).

See the example notebook at https://github.com/jerabaul29/public_bug_reports/blob/main/xarray/2023_10_18/interp.ipynb .


Problem

It is always a bit risky to interpolate / find the nearest neighbor to a query or similar, as bad things can happen if querying a value for a point that is outside of the area that is represented. Fortunately, xarray returns NaN if performing interp outside of the bounds of a dataset:

```python import xarray as xr import numpy as np

xr.version

'2023.9.0'

data = np.array([[1, 2, 3], [4, 5, 6]]) lat = [10, 20] lon = [120, 130, 140]

data_xr = xr.DataArray(data, coords={'lat':lat, 'lon':lon}, dims=['lat', 'lon'])

data_xr

<xarray.DataArray (lat: 2, lon: 3)> array([[1, 2, 3], [4, 5, 6]]) Coordinates: * lat (lat) int64 10 20 * lon (lon) int64 120 130 140

interp is civilized: rather than wildly extrapolating, it returns NaN

data_xr.interp(lat=15, lon=125)

<xarray.DataArray ()> array(3.) Coordinates: lat int64 15 lon int64 125

data_xr.interp(lat=5, lon=125)

<xarray.DataArray ()> array(nan) Coordinates: lat int64 5 lon int64 125 ```

Unfortunately, .sel will happily find the nearest neighbor of a point, even if the input point is outside of the dataset range:

```python

sel is not as civilized: it happily finds the neares neighbor, even if it is "on the one side" of the example data

data_xr.sel(lat=5, lon=125, method='nearest')

<xarray.DataArray ()> array(2) Coordinates: lat int64 10 lon int64 130 ```

This can easily cause tricky bugs.


Discussion

Would it be possible for .sel to have a behavior that makes the user aware of such issues? I.e. either:

  • print a warning on stderr
  • return NaN
  • raise an exception

when performing a .sel query that is outside of a dataset range / not in between of 2 dataset points?

I understand that finding the nearest neighbor may still be useful / wanted in some cases even when being outside of the bounds of the dataset, but the fact that this happens silently by default has been causing bugs for us. Could either this default behavior be changed, or maybe enabled with a flag (allow_extrapolate=False by default for example, so users can consciously opt it in)?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8335/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2098252325 I_kwDOAMm_X859EMol 8653 xarray v 2023.9.0: ```ValueError: unable to infer dtype on variable 'time'; xarray cannot serialize arbitrary Python objects``` jerabaul29 8382834 open 0     1 2024-01-24T13:18:55Z 2024-02-05T12:50:34Z   CONTRIBUTOR      

What happened?

I tried to save an xarray dataset with datetimes as data for its time dimension to a nc file with to_netcdf and got the error ValueError: unable to infer dtype on variable 'time'; xarray cannot serialize arbitrary Python objects.

What did you expect to happen?

I expected xarray to automatically detect these were datetimes, and convert them to whatever format xarray likes to work with internally to dump it into a CF compatible file, following what is described at https://github.com/pydata/xarray/issues/2512 .

Minimal Complete Verifiable Example

```Python import xarray as xr import datetime

times = [datetime.datetime(2024, 1, 1, 1, 1, 1, tzinfo=datetime.timezone.utc), datetime.datetime(2024, 1, 1, 1, 1, 2, tzinfo=datetime.timezone.utc)]

data = [1, 2]

xr_result = xr.Dataset( { 'time': xr.DataArray(dims=["time"], data=times, attrs={ "standard_name": "time", }), # 'data': xr.DataArray(dims=["time"], data=data, attrs={ "_FillValue": "NaN", "standard_name": "some_data", }), } )

xr_result.to_netcdf("test.nc") ```

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [ ] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

The example is available as a notebook viewable at:

https://github.com/jerabaul29/public_bug_reports/blob/main/xarray/2024_01_24/xarray_and_datetimes.ipynb

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 6.5.0-14-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2023.9.0 pandas: 2.0.3 numpy: 1.25.2 scipy: 1.11.3 netCDF4: 1.6.2 pydap: None h5netcdf: None h5py: 3.10.0 Nio: None zarr: None cftime: 1.6.3 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: 1.3.5 dask: 2023.9.2 distributed: 2023.9.2 matplotlib: 3.7.2 cartopy: 0.21.1 seaborn: 0.13.0 numbagg: None fsspec: 2023.9.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.0.0 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.15.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8653/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1853356670 I_kwDOAMm_X85ud_p- 8074 Add an ```only_variables``` or similar option to ```xarray.open_dataset``` and ```xarray.open_mfdataset``` jerabaul29 8382834 open 0     7 2023-08-16T14:23:43Z 2023-08-21T06:55:17Z   CONTRIBUTOR      

Is your feature request related to a problem?

Sometimes, a variable in a nc file is corrupted or not "xarray friendly" and crashes opening a file (see for example https://github.com/pydata/xarray/issues/8072 ; I solved this on my machine by just drop_variablesing the problematic variables in practice), or reading and parsing the full file or mf-file may be expensive and time consuming, while only a couple of variables are needed.

Describe the solution you'd like

We already can exclude variables with the drop_variables arg to open_dataset (note: this is not present for now in open_mfdataset, should it be added there?), but could we also instead of saying "read all the variables instead of this list", be able to say "read only these variables"? In most case, this would be equivalent of using drop_variables=list(set(all_vars)-set(list_insteresting_vars), but in case some (many vars) may be corrupted, just getting the file opened to list these all_vars may be problematic.

Describe alternatives you've considered

drop_variables=list(set(all_vars)-set(list_insteresting_vars), but this is a lot more verbose.

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8074/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1552701403 I_kwDOAMm_X85cjFfb 7468 Provide default APIs and functions for getting variable at a given location, based on some criteria / extrema conditions on other variables jerabaul29 8382834 open 0     0 2023-01-23T08:35:43Z 2023-01-23T08:35:43Z   CONTRIBUTOR      

Is your feature request related to a problem?

No, this is related to a need that comes regularly when working with netCDF files in geosciences.

Describe the solution you'd like

what is needed

There are many cases with netcdf files when one wants to find some location, or get variable(s) at some location, where the location is determined by a condition on some variables. A classical example, around which there are many stack overflow questions, online discussions, suggested "hacky" solution, snippets etc, available, is something like the following. Given a file that looks like this:

dimensions: nj = 949 ; ni = 739 ; nc = 5 ; time = UNLIMITED ; // (24 currently) variables: float TLAT(nj, ni) ; TLAT:long_name = "T grid center latitude" ; TLAT:units = "degrees_north" ; TLAT:missing_value = 1.e+30f ; TLAT:_FillValue = 1.e+30f ; float TLON(nj, ni) ; TLON:long_name = "T grid center longitude" ; TLON:units = "degrees_east" ; TLON:missing_value = 1.e+30f ; TLON:_FillValue = 1.e+30f ; float Tair_h(time, nj, ni) ; Tair_h:units = "C" ; Tair_h:long_name = "air temperature" ; Tair_h:coordinates = "TLON TLAT time" ; Tair_h:cell_measures = "area: tarea" ; Tair_h:missing_value = 1.e+30f ; Tair_h:_FillValue = 1.e+30f ; Tair_h:time_rep = "instantaneous" ;

answer a question like:

  • find the mesh point (ni, nj) closest to the location (TLAT=latval, TLON=lonval)?
  • give the nearest / interpolated value of Tair_h at latitude and longitude (latval, lonval)
  • do the same as above for lists / arrays of coordinates.

I do not think there is a recommended, standard, simple / one liner to do this with xarray in general (in particular if the (latval, lonval) falls out of the discrete set of mesh nodes). This means that a there are plenty of ad hoc hacked solutions getting shared around to solve this. Having a default recommended way would likely help users quite a bit and save quite some work.

the existing ways to solve the need

As soon as the TLAT and TLON are not "aligned" with the ni and nj coordinates (if they exactly match a mesh point, then likely some .where(TLAT=latval, TLON=lonval) can do), this is a bit of work. One has typically to:

  • build the 2D (dependent on (ni, nj) ) field representing the function (ni, nj) -> distance(node(ni, nj), point(latval, lonval) )
  • find the smallest value on this field to get the nearest coordinate and the value there, or the few smallest values and use some interpolation to interpolate

There are many more examples of questions that revolve around this kind of "query", and the answers are usually ad-hoc, though a lot of the logics repeat themselves, which make me believe a general high quality / standard solution would be useful:

  • https://stackoverflow.com/questions/58758480/xarray-select-nearest-lat-lon-with-multi-dimension-coordinates
  • https://gis.stackexchange.com/questions/357026/indexing-coordinates-in-order-to-get-a-value-by-coordinates-from-xarray (but what in the case where the point looked for "falls between" mesh nodes?)

Also note that most of these answers use simple / relatively naive / inefficient algorithms, but I wonder if there are some examples of code that could be used to build this in an efficient way, see the discussions in:

  • https://github.com/xarray-contrib/xoak
  • https://stackoverflow.com/questions/10818546/finding-index-of-nearest-point-in-numpy-arrays-of-x-and-y-coordinates
  • https://stackoverflow.com/questions/2566412/find-nearest-value-in-numpy-array

It looks like there are some snippets available that can be use to do this more or less exactly, when the netcdf file follows some conventions:

  • https://gist.github.com/blaylockbk/0ac5427b09fbae8d367a691ff90cdb4e

It looks like there is no dedicated / recommended / default xarray solution to do this though. It would be great if xarray could offer a (set of) well tested, well implemented, efficient way(s) to solve this kind of needs. I guess this is such a common need that providing a default solution with a default API, even if it is not optimal for all use cases, would be better than providing nothing at all and have users hack their own helper functions.

what xarray could implement

It would be great if xarray could offer support for this built in. A few thoughts of how this could be done:

  • calculate function on all array based on specification
  • find closest / interpolation way
  • provide a few default "assemblies" of these functions to support common file kinds
  • provide some ways to check that the request is reasonable / well formulated (for example, some functions in the kind of check_that_convex, that would check that taking a minimum is more or less reasonable).

I wonder if thinking about a few APIs and agreeing on these would be helpful before implementing anything. Just for the sake of brainstorming, maybe some functions with this kind of "API pseudocode" on datasets could make sense / would be a nice standardization to offer to users? Any thoughts / ideas of better solution?

python def compute_function_on_mesh(self, function_to_compute_on_nodes(arg1, ..., argn), list_args_to_use_in_funcion[var1, ..., varn]) -> numpy_2d_array: """compute function_to_compute_on_nodes at each "node point" of the dataset, using as arguments to the function the value from var1, ..., varn at each corresponding node."""

python def find_node_with_lowest_value(self, function_to_compute_on_nodes(arg1, ..., argn), list_args_to_use_in_funcion[var1, ..., varn]) -> Tuple(dim1, ..., dimn): """compute function_to_compute_on_nodes at each "node point" of the dataset, using as arguments to the function the value from var1, ..., varn at each corresponding node, and return the node coordinates that minimize the function."""

python def get_variable_at_node_with_lowest_value(self, variable_to_use, function_to_compute_on_nodes(arg1, ..., argn), list_args_to_use_in_funcion[var1, ..., varn]) -> float: """compute function_to_compute_on_nodes at each "node point" of the dataset, using as arguments to the function the value from var1, ..., varn at each corresponding node, and return the variable_to_use value at the node coordinates that minimize the function."""

(note: for this last function, consider also providing a variant that performs interpolation outside of mesh points?)

Maybe providing a few specializations for working with finding specific points in space would be useful? Like:

python def get_variable_at_closest_location(self, variable_to_use, variable_lat, variable_lon, latvalue, lonvalue) -> float: """get variable_to_use at the mesh point closest to (latvalue, lonvalue), using the variables variable_lat, variable_lon as the lat and lon axis."""

Describe alternatives you've considered

Writing my own small function, or re-using some snippet circulating on internet.

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7468/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 21.161ms · About: xarray-datasette