issues: 1337337135

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1337337135	I_kwDOAMm_X85PtiUv	6911	Public hypothesis strategies for generating xarray data	35968931	open	0			0	2022-08-12T15:17:40Z	2022-08-12T17:46:48Z		MEMBER				Proposal We should expose a public set of hypothesis strategies for use in testing xarray code. It could be useful for downstream users, but also for our own internal test suite. It should live in `xarray.testing.strategies`. Specifically perhaps `xarray.testing.strategies.variables` `xarray.testing.strategies.dataarrays` `xarray.testing.strategies.datasets` (`xarray.testing.strategies.datatrees` ?) `xarray.testing.strategies.indexes` `xarray.testing.strategies.chunksizes` following `dask.array.testing.strategies.chunks` This issue is different from #1846 because that issue describes how we could use such strategies in our own testing code, whereas this issue is for how we create general strategies that we could use in many places (including exposing publicly). I've become interested in this as part of wanting to see #6894 happen. #6908 would effectively close this issue, but itself is just a pulled out section of all the work @keewis did in #4972. (Also xref https://github.com/pydata/xarray/issues/2686. Also also @max-sixty didn't you have an issue somewhere about creating better and public test fixtures?) Previous work I was pretty surprised to see this comment by @Zac-HD in #1846 @rdturnermtl wrote a Hypothesis extension for Xarray, which is at least a nice demo of what's possible. given that we might have just used that instead of writing new ones in #4972! (@keewis had you already seen that extension?) We could literally just include that extension in xarray and call this issue solved... Shrinking performance of strategies However I was also reading about strategies that shrink yesterday and think that we should try to make some effort to come up with strategies for producing xarray objects that shrink in a performant and well-motivated manner. In particular by pooling the knowledge of the @xarray-dev core team we could try to create strategies that search for many of the edge cases that we are collectively aware of. My understanding of that guide is that our strategies ideally should: 1) Quickly include or exclude complexity For instance `if draw(booleans()): # then add coordinates to generated dataset`. It might also be nice to have strategy constructors which allow passing other strategies in, so the user can choose how much complexity they want their strategy to generate. e.g. I think a signature like this should be possible ```python from hypothesis import strategies as st @st.composite def dataarrays( data: xr.Variable \| st.SearchStrategy[xr.Variable] \| duckarray \| st.SearchStrategy[duckarray] \| None ..., coords: ..., dims: ..., attrs: ..., name: ..., ) -> st.SearchStrategy[xr.DataArray]: """ Hypothesis strategy for generating arbitrary DataArray objects. Parameters ---------- data Can pass an absolute value of an appropriate type (i.e. `Variable`, `np.ndarray` etc.), or pass a strategy which generates such types. Default is that the generated DataArray could contain any possible data. ... (similar flexibility for other constructor arguments) """ ... ``` 2) Deliberately generate known edge cases For instance deliberately create: - dimension coordinates, - names which are Hashable but not strings, - multi-indexes, - weird dtypes, - NaNs, - duckarrays instead of `np.ndarray`, - inconsistent chunking between different variables, - (any other ideas?) 3) Be very modular internally, to help with "keeping things local" Each sub-strategy should be in its own function, so that hypothesis' decision tree can cut branches off as soon as possible. 4) Avoid obvious inefficiencies e.g. not `.filter(...)` or `assume(...)` if we can help it, and if we do need them then keep them in the same function that generates that data. Plus just keep all sizes small by default. Perhaps the solutions implemented in #6894 or this hypothesis xarray extension already meet these criteria - I'm not sure. I just wanted a dedicated place to discuss building the strategies specifically, without it getting mixed in with complicated discussions about whatever we're trying to use the strategies for!	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6911/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			13221727	issue

Links from other tables

3 rows from issues_id in issues_labels
0 rows from issue in issue_comments