home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 765561903

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/pull/4700#issuecomment-765561903 https://api.github.com/repos/pydata/xarray/issues/4700 765561903 MDEyOklzc3VlQ29tbWVudDc2NTU2MTkwMw== 13301940 2021-01-22T17:13:39Z 2021-01-22T17:14:52Z MEMBER

Yes, I'd say go ahead. (I just hope it's not too big of a performance hit for normal use cases.)

@mathause, I am noticing a performance hit even for the special use cases. Here's how I am doing the sampling

python sample_indices = np.random.choice(array.size, size=min(20, array.size), replace=False) native_dtypes = set(np.vectorize(type, otypes=[object])(array.ravel()[sample_indices]))

and here's the code snippet I tested this on:

```python In [1]: import xarray as xr, numpy as np

In [2]: x = np.asarray(list("abcdefghijklmnopqrstuvwxyz"), dtype="object")

In [3]: array = np.repeat(x, 5_000_000)

In [4]: array.size Out[4]: 130000000

In [5]: array.dtype Out[5]: dtype('O') ```

Without sampling

python In [6]: %timeit xr.conventions._infer_dtype(array, "test") 7.63 s ± 515 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With sampling

python In [15]: %timeit xr.conventions._infer_dtype(array, "test") 8.31 s ± 395 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I could be wrong, but the sampling doesn't seem to be worth it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  768981497
Powered by Datasette · Queries took 0.633ms · About: xarray-datasette