home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 609471635

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/pull/3925#issuecomment-609471635 https://api.github.com/repos/pydata/xarray/issues/3925 609471635 MDEyOklzc3VlQ29tbWVudDYwOTQ3MTYzNQ== 1217238 2020-04-05T19:45:07Z 2020-04-05T19:45:07Z MEMBER

I think this is generally a good idea!

In the future, creating an index explicit would just mean: 1. Repeated lookups are more efficient, due to caching in a hash-table. 2. The coordinate values are immutable, to ensure that the index values can be cached.

One minor concern I have here is about efficiency: building an pd.Index and its hash-table from scratch can be quite expensive. If we're only doing a single lookup this is fine, but if it's being done in a loop this could be surprisingly slow, and we would do far better sticking with pure NumPy operations.

Here's an microbenchmark that hopefully illustrates the issue: ``` import pandas as pd import numpy as np

def lookup_preindexed(needle, index): return index.get_loc(needle)

def lookup_newindex(needle, haystack): return lookup_preindexed(needle, pd.Index(haystack))

def lookup_numpy(needle, haystack): return (haystack == needle).argmax()

haystack = np.random.permutation(np.arange(1000000)) index = pd.Index(haystack) %timeit lookup_newindex(0, haystack) # 56.1 ms per loop %timeit lookup_preindexed(0, index) # 696 ns per loop %timeit lookup_numpy(0, haystack) # 517 µs per loop ```

pandas is 1000x faster than NumPy if the index is pre-existing, but 100x slower if the index is new. That's a 1e5 fold slow-down!

I think users will appreciate the flexibility, but if there's some way we warn users that they really should set the index ahead of time when they are doing repeating indexing that could also be welcome. Figuring out how to save the state for counting the number of times a new index is created could be pretty messy, though. I guess we could stuff it into Variable.encoding and issue a warning whenever the same variable has been converted into an index at least 100 times.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  592312709
Powered by Datasette · Queries took 1.668ms · About: xarray-datasette