issues: 1090229430

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1090229430	I_kwDOAMm_X85A-5S2	6124	bool(ds) should raise a "the truth value of a Dataset is ambiguous" error	3698640	open	0			15	2021-12-29T02:35:39Z	2023-03-12T15:47:01Z		CONTRIBUTOR				Throwing this out there - happy to be shot down if people are opposed. Current behavior / griping Currently, coercing a dataset to a boolean invokes `ds.__bool__`, which in turn calls `bool(ds.data_vars)`: `python class Dataset(DataWithCoords, DatasetArithmetic, Mapping): ... def __bool__(self) -> bool: return bool(self.data_vars)` This has the unfortunate property of returning True as long as there is at least one data_variable, regardless of the contents. Currently, the behavior of `Dataset.__bool__` is, at least as far as I've seen, never helpful but frequently unhelpful. I've seen (and written) tests written for DataArrays being passed a Dataset and suddenly the tests are meaningless so many times. Conversely, I've never found a legitimate use case for `bool(ds)`. As far as I can tell, this is essentially the same as `len(ds.data_vars) > 0`. In fact, while testing out my proposed changes below on a fork, I found two tests in the xarray test suite that had succumbed to this issue: see https://github.com/pydata/xarray/pull/6122 and https://github.com/pydata/xarray/pull/6123. This has been discussed before - see https://github.com/pydata/xarray/issues/4290. This discussion focused on the question "should `bool(xr.Dataset({'a': False}))` return False?". I agree that it's not clear when it should be false and picking a behavior which deviates from Mapping feels arbitrary and gross. Proposed behavior I'm proposing that the API be changed, so that `bool(xr.Dataset({'a': False}))` raise an error, similar to the implementation in `pd.Series`. In this implementation in pandas, attempting to evaluate even a single-element series as a boolean raises an error: ```python In [14]: bool(pd.Series([False])) ValueError Traceback (most recent call last) <ipython-input-14-b0ad7f4d9277> in <module> ----> 1 bool(pd.Series([False])) ~/miniconda3/envs/rhodium-env/lib/python3.9/site-packages/pandas/core/generic.py in nonzero(self) 1532 @final 1533 def nonzero(self): -> 1534 raise ValueError( 1535 f"The truth value of a {type(self).name} is ambiguous. " 1536 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). ``` I understand hesitancy around changing the core API. That said, if anyone can find an important, correct use of `bool(ds)` in the wild I'll eat my hat :) Implementation This could be as simple as raising an error on `ds.__bool__`, something like: `python class Dataset(DataWithCoords, DatasetArithmetic, Mapping): ... def __bool__(self) -> bool: raise ValueError( "The truth value of a Dataset is ambiguous. Reduce the data " "variables to a scalar value with any(ds.values()) or " "all(ds.values())." )` The only other change that would be needed is an assertion that directly calls `bool(ds)` in test_dataset::TestDataset.test_properties, which checks for the exact behavior I'm changing: `python assert bool(ds)` This would need to be changed to: `python with pytest.raises(ValueError): bool(ds)` If this sounds good, I can submit a PR with these changes.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6124/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			13221727	issue

Links from other tables

2 rows from issues_id in issues_labels
15 rows from issue in issue_comments