home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1090229430

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1090229430 I_kwDOAMm_X85A-5S2 6124 bool(ds) should raise a "the truth value of a Dataset is ambiguous" error 3698640 open 0     15 2021-12-29T02:35:39Z 2023-03-12T15:47:01Z   CONTRIBUTOR      

Throwing this out there - happy to be shot down if people are opposed.

Current behavior / griping

Currently, coercing a dataset to a boolean invokes ds.__bool__, which in turn calls bool(ds.data_vars):

python class Dataset(DataWithCoords, DatasetArithmetic, Mapping): ... def __bool__(self) -> bool: return bool(self.data_vars)

This has the unfortunate property of returning True as long as there is at least one data_variable, regardless of the contents.

Currently, the behavior of Dataset.__bool__ is, at least as far as I've seen, never helpful but frequently unhelpful. I've seen (and written) tests written for DataArrays being passed a Dataset and suddenly the tests are meaningless so many times. Conversely, I've never found a legitimate use case for bool(ds). As far as I can tell, this is essentially the same as len(ds.data_vars) > 0.

In fact, while testing out my proposed changes below on a fork, I found two tests in the xarray test suite that had succumbed to this issue: see https://github.com/pydata/xarray/pull/6122 and https://github.com/pydata/xarray/pull/6123.

This has been discussed before - see https://github.com/pydata/xarray/issues/4290. This discussion focused on the question "should bool(xr.Dataset({'a': False})) return False?". I agree that it's not clear when it should be false and picking a behavior which deviates from Mapping feels arbitrary and gross.

Proposed behavior

I'm proposing that the API be changed, so that bool(xr.Dataset({'a': False})) raise an error, similar to the implementation in pd.Series.

In this implementation in pandas, attempting to evaluate even a single-element series as a boolean raises an error:

```python In [14]: bool(pd.Series([False]))


ValueError Traceback (most recent call last) <ipython-input-14-b0ad7f4d9277> in <module> ----> 1 bool(pd.Series([False]))

~/miniconda3/envs/rhodium-env/lib/python3.9/site-packages/pandas/core/generic.py in nonzero(self) 1532 @final 1533 def nonzero(self): -> 1534 raise ValueError( 1535 f"The truth value of a {type(self).name} is ambiguous. " 1536 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). ```

I understand hesitancy around changing the core API. That said, if anyone can find an important, correct use of bool(ds) in the wild I'll eat my hat :)

Implementation

This could be as simple as raising an error on ds.__bool__, something like:

python class Dataset(DataWithCoords, DatasetArithmetic, Mapping): ... def __bool__(self) -> bool: raise ValueError( "The truth value of a Dataset is ambiguous. Reduce the data " "variables to a scalar value with any(ds.values()) or " "all(ds.values())." )

The only other change that would be needed is an assertion that directly calls bool(ds) in test_dataset::TestDataset.test_properties, which checks for the exact behavior I'm changing:

python assert bool(ds)

This would need to be changed to:

python with pytest.raises(ValueError): bool(ds)

If this sounds good, I can submit a PR with these changes.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6124/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 15 rows from issue in issue_comments
Powered by Datasette · Queries took 0.797ms · About: xarray-datasette