home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1611701140

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1611701140 I_kwDOAMm_X85gEJuU 7588 xr.merge with compat="minimal" returns corrupted Dataset and causes __len__ to return wrong and possibly negative values. 2466330 closed 0     4 2023-03-06T15:47:40Z 2023-08-30T09:14:19Z 2023-08-30T07:57:37Z CONTRIBUTOR      

What happened?

When merging multiple datasets with the compat="minimal" option, coordinates whose variables are dropped due to incompatibility are still saved in the dataset's _coord_names. I believe the cause for this to originate in line 752 of merge_core, where the coordinate names are based on the datasets in coerced, which is not impacted by the dropping of (coordinate) variables/indexes in the merge_collected function.

This is directly related to the bug described in issue 7405. As seen there, one result is that dropped coordinate still evaluates as being contained in the resulting dataset's coords. The effects of this bug are more widespread, which this issue attempts to dive into.

At least one other (perhaps more severe) result of this bug is connected to the fact that the __len__ function of a DataVariable is implemented as follows: return len(self._dataset._variables) - len(self._dataset._coord_names)

If a coordinate was dropped as a result of the merge, it is no longer part of the _variables, but still listed in the _coord_names, and as such the result of len() will be off by 1 for each such coordinate. This also means that the result of len() can become negative, which causes python to raise ValueError: __len__() should return >= 0.

One instance where this causes immediate errors is when trying to print the resulting dataset. As part of the __repr__ of a Dataset, a boolean evaluation of the DataVariable is performed (if mapping: in xarray/core/formatting.py in _mapping_repr), calling __len__ to check the truth value and triggering the ValueError.

While this is undoubtedly only one of many places where the incorrect __len__ causes issues, it is a rather pressing one as it even stops one from inspecting the Dataset in the most common way (printing it). The ValueError it produces is also very hard to trace back to the actual cause, likely completely throwing users off from fixing their code.

What did you expect to happen?

To get a Dataset with the correct _coord_names property, and in no circumstance whatsoever to get a Dataset which reports a negative length

Minimal Complete Verifiable Example

```Python import xarray as xr ds1 = xr.Dataset(coords={"foo": [1, 2, 3], "bar": 4}) ds2 = xr.Dataset(coords={"foo": [1, 2, 3], "bar": 5})

res = xr.merge([ds1, ds2], compat="minimal") # If the result is not captured in res, this will cause a ValueError as the interpreter attempts to print the result

res.coords

Coordinates:

* foo (foo) int64 1 2 3

res._coord_names

{'foo', 'bar'}

"bar" in res.coords # As shown in issue #7405. Note "bar" is not printed in res.coords, revealing an interesting disconnect in behaviors of different functions targeting a dataset's coordinates

True

res

ValueError: len() should return >= 0

```

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

```Python

import xarray as xr ds1 = xr.Dataset(coords={"foo": [1, 2, 3], "bar": 4}) ds2 = xr.Dataset(coords={"foo": [1, 2, 3], "bar": 5}) res = xr.merge([ds1, ds2], compat="minimal") res.coords Coordinates: * foo (foo) int64 1 2 3 res._coord_names {'bar', 'foo'} "bar" in res.coords True res Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/redacted/.venv/lib/python3.10/site-packages/xarray/core/dataset.py", line 2116, in repr return formatting.dataset_repr(self) File "/usr/lib/python3.10/reprlib.py", line 21, in wrapper result = user_function(self) File "/home/redacted/.venv/lib/python3.10/site-packages/xarray/core/formatting.py", line 673, in dataset_repr summary.append(data_vars_repr(ds.data_vars, col_width=col_width, max_rows=max_rows)) File "/home/redacted/.lvenv/lib/python3.10/site-packages/xarray/core/formatting.py", line 357, in _mapping_repr if mapping: ValueError: len() should return >= 0 ```

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.10.16.3-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2023.2.0 pandas: 1.5.1 numpy: 1.24.2 scipy: 1.10.0 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.13.6 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.10.3 iris: None bottleneck: 1.3.6 dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2023.1.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 59.6.0 pip: 23.0.1 conda: None pytest: 7.2.1 mypy: 1.0.1 IPython: 7.34.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7588/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 2 rows from issue in issue_comments
Powered by Datasette · Queries took 0.64ms · About: xarray-datasette