issue_comments: 765561903

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/pull/4700#issuecomment-765561903	https://api.github.com/repos/pydata/xarray/issues/4700	765561903	MDEyOklzc3VlQ29tbWVudDc2NTU2MTkwMw==	13301940	2021-01-22T17:13:39Z	2021-01-22T17:14:52Z	MEMBER	Yes, I'd say go ahead. (I just hope it's not too big of a performance hit for normal use cases.) @mathause, I am noticing a performance hit even for the special use cases. Here's how I am doing the sampling `python sample_indices = np.random.choice(array.size, size=min(20, array.size), replace=False) native_dtypes = set(np.vectorize(type, otypes=[object])(array.ravel()[sample_indices]))` and here's the code snippet I tested this on: ```python In [1]: import xarray as xr, numpy as np In [2]: x = np.asarray(list("abcdefghijklmnopqrstuvwxyz"), dtype="object") In [3]: array = np.repeat(x, 5_000_000) In [4]: array.size Out[4]: 130000000 In [5]: array.dtype Out[5]: dtype('O') ``` Without sampling `python In [6]: %timeit xr.conventions._infer_dtype(array, "test") 7.63 s ± 515 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` With sampling `python In [15]: %timeit xr.conventions._infer_dtype(array, "test") 8.31 s ± 395 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` I could be wrong, but the sampling doesn't seem to be worth it.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		768981497