home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

2 rows where state = "open", type = "issue" and user = 16925278 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

type 1

  • issue · 2 ✖

state 1

  • open · 2 ✖

repo 1

  • xarray 2
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2127671156 I_kwDOAMm_X85-0a90 8728 Lingering memory connections when extracting underlying `np.arrays` from datasets ks905383 16925278 open 0     6 2024-02-09T18:39:34Z 2024-02-26T06:02:15Z   CONTRIBUTOR      

What is your issue?

I know that generally, ds2 = ds connects the two objects in memory, and changes in one will also cause changes in the other.

However, I generally assume that certain operations should break this connection, for example: - extracting the underlying np.array from a dataset (changing its type and destroying a lot of the xarray-specific information: index, dimensions, etc.) - using the underlying np.array into a new dataset

In other words, I would expect that using ds['var'].values would be similar to copy.deepcopy(ds['var'].values).

Here's an example that illustrates how in these cases, the objects are still linked in memory:

(apologies for the somewhat hokey example)

``` import xarray as xr import numpy as np

Create a dataset

ds = xr.Dataset(coords = {'lon':(['lon'],np.array([178.2,179.2,-179.8, -178.8,-177.8,-176.8]))}) print('\nds: ') print(ds)

Create a new dataset that uses the values of the first dataset

ds2 = xr.Dataset({'lon1':(['lon'],ds.lon.values)}, coords = {'lon':(['lon'],ds.lon.values)}) print('\nds2: ') print(ds2)

Change ds2's 'lon1' variable

ds2['lon1'][ds2['lon1']<0] = 360 + ds2['lon1'][ds2['lon1']<0]

ds2 is changed as expected

print('\nds2 (should be modified): ') print(ds2)

ds is changed, which is not expected

print('\nds (should not be modified): ') print(ds) ```

The question is - am I right (from a UX perspective) to expect these kinds of operations to disconnect the objects in memory? If so, I might try to update the docs to be a bit clearer on this. (or, alternatively, if these kinds of operations should disconnect the objects in memory, maybe it's better to have .values also call .copy(deep=True).values)

Appreciate y'all's thoughts on this!

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8728/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1842072960 I_kwDOAMm_X85ty82A 8058 `ds.interp()` breaks if (non-interpolating) dimension is not numeric ks905383 16925278 open 0     1 2023-08-08T20:43:08Z 2023-08-08T20:52:15Z   CONTRIBUTOR      

What happened?

I'm running ds.interp() using multi-dimensional new coordinates, using xarray's broadcasting to expand the original dataset to new dimensions. In this case, I'm only interpolating on one dimension, but broadcasting out to others.

If the dimensions are all numeric (or, presumably, able to be forced to numeric), then this works without an issue. However, if one of the other dimensions is, e.g., populated with string indices (weather station names, model run ids, etc.), then this process fails, even if the dimension on which the interpolating is conducted is purely numeric.

What did you expect to happen?

Here is an example with only numeric dimensions that works as expected: ``` import xarray as xr import numpy as np

da1 = xr.DataArray(np.reshape(np.arange(0,12),(3,4)), coords = {'dim0':np.arange(0,3), 'dim1':np.arange(0,4)})

da2 = xr.DataArray(np.random.normal(loc=1,size=(2,4),scale=0.5), coords = {'dim2':np.arange(0,2), 'dim1':np.arange(0,4)})

da1.interp(dim0=da2) ``` this produces something like: as expected.

Minimal Complete Verifiable Example

```Python import xarray as xr import numpy as np

da1 = xr.DataArray(np.reshape(np.arange(0,12),(3,4)), coords = {'dim0':np.arange(0,3), 'dim1':np.arange(0,4).astype(str)})

da2 = xr.DataArray(np.random.normal(loc=1,size=(2,4),scale=0.5), coords = {'dim2':np.arange(0,2), 'dim1':np.arange(0,4).astype(str)})

da1.interp(dim0=da2) ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

```Python

TypeError Traceback (most recent call last) Cell In[48], line 9 1 da1 = xr.DataArray(np.reshape(np.arange(0,12),(3,4)), 2 coords = {'dim0':np.arange(0,3), 3 'dim1':np.arange(0,4).astype(str)}) 5 da2 = xr.DataArray(np.random.normal(loc=1,size=(2,4),scale=0.5), 6 coords = {'dim2':np.arange(0,2), 7 'dim1':np.arange(0,4).astype(str)}) ----> 9 da1.interp(dim0=da2)

File ~/.conda/envs/climate/lib/python3.10/site-packages/xarray/core/dataarray.py:2204, in DataArray.interp(self, coords, method, assume_sorted, kwargs, coords_kwargs) 2199 if self.dtype.kind not in "uifc": 2200 raise TypeError( 2201 "interp only works for a numeric type array. " 2202 "Given {}.".format(self.dtype) 2203 ) -> 2204 ds = self._to_temp_dataset().interp( 2205 coords, 2206 method=method, 2207 kwargs=kwargs, 2208 assume_sorted=assume_sorted, 2209 coords_kwargs, 2210 ) 2211 return self._from_temp_dataset(ds)

File ~/.conda/envs/climate/lib/python3.10/site-packages/xarray/core/dataset.py:3666, in Dataset.interp(self, coords, method, assume_sorted, kwargs, method_non_numeric, **coords_kwargs) 3664 if method in ["linear", "nearest"]: 3665 for k, v in validated_indexers.items(): -> 3666 obj, newidx = missing._localize(obj, {k: v}) 3667 validated_indexers[k] = newidx[k] 3669 # optimization: create dask coordinate arrays once per Dataset 3670 # rather than once per Variable when dask.array.unify_chunks is called later 3671 # GH4739

File ~/.conda/envs/climate/lib/python3.10/site-packages/xarray/core/missing.py:562, in _localize(var, indexes_coords) 560 indexes = {} 561 for dim, [x, new_x] in indexes_coords.items(): --> 562 minval = np.nanmin(new_x.values) 563 maxval = np.nanmax(new_x.values) 564 index = x.to_index()

File <array_function internals>:5, in nanmin(args, *kwargs)

File ~/.conda/envs/climate/lib/python3.10/site-packages/numpy/lib/nanfunctions.py:319, in nanmin(a, axis, out, keepdims) 315 kwargs['keepdims'] = keepdims 316 if type(a) is np.ndarray and a.dtype != np.object_: 317 # Fast, but not safe for subclasses of ndarray, or object arrays, 318 # which do not implement isnan (gh-9009), or fmin correctly (gh-8975) --> 319 res = np.fmin.reduce(a, axis=axis, out=out, **kwargs) 320 if np.isnan(res).any(): 321 warnings.warn("All-NaN slice encountered", RuntimeWarning, 322 stacklevel=3)

TypeError: cannot perform reduce with flexible type ```

Anything else we need to know?

I'm pretty sure the issue is in this optimization step.

It calls _localize() from missing.py, which calls np.nanmin() and np.nanmax() on all the coordinates, including the ones that aren't used in the interpolation, but only in the broadcasting.

Perhaps a way to fix this would be to have a test in localize for numeric indices, and then only subset the numeric dimensions? (I could see generalizing _localize() to other data types may be more trouble than it's worth, especially for unsorted string dimensions...) Or only subset the dimensions used in the interpolation itself? Or, alternatively, having a way to turn off optimizations like this?

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.76.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: (None, None) libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2023.7.0 pandas: 1.4.1 numpy: 1.21.6 scipy: 1.11.1 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: 1.5.5 zarr: 2.13.2 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.3.0 distributed: 2023.3.0 matplotlib: 3.5.1 cartopy: 0.20.2 seaborn: 0.11.2 numbagg: None fsspec: 2022.5.0 cupy: None pint: 0.22 sparse: 0.14.0 flox: None numpy_groupies: None setuptools: 68.0.0 pip: 23.2.1 conda: None pytest: 7.0.1 mypy: None IPython: 8.14.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8058/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 23.094ms · About: xarray-datasette