home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 538570946

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/2799#issuecomment-538570946 https://api.github.com/repos/pydata/xarray/issues/2799 538570946 MDEyOklzc3VlQ29tbWVudDUzODU3MDk0Ng== 6213168 2019-10-04T21:48:18Z 2019-10-06T21:56:58Z MEMBER

I simplified the benchmark: ```python from itertools import product

import numpy as np import xarray as xr

shape = (10, 10, 10, 10) index = (0, 0, 0, 0) np_arr = np.ones(shape) arr = xr.DataArray(np_arr) named_index = dict(zip(arr.dims, index))

print(index) print(named_index)

%timeit -n 1000 arr[index] %timeit -n 1000 arr.isel(**named_index) %timeit -n 1000 np_arr[index] (0, 0, 0, 0) {'dim_0': 0, 'dim_1': 0, 'dim_2': 0, 'dim_3': 0} 90.8 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 88.5 µs ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 115 ns ± 6.71 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) python %%prun -s cumulative for _ in range(10000): arr[index] 5680003 function calls (5630003 primitive calls) in 1.890 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 1.890 1.890 {built-in method builtins.exec} 1 0.009 0.009 1.890 1.890 <string>:1(<module>) 10000 0.011 0.000 1.881 0.000 dataarray.py:629(getitem) 10000 0.030 0.000 1.801 0.000 dataarray.py:988(isel) 10000 0.084 0.000 1.567 0.000 dataset.py:1842(isel) 10000 0.094 0.000 0.570 0.000 dataset.py:1746(_validate_indexers) 10000 0.029 0.000 0.375 0.000 variable.py:960(isel) 10000 0.013 0.000 0.319 0.000 variable.py:666(getitem) 20000 0.014 0.000 0.251 0.000 dataset.py:918(_replace_with_new_dims) 50000 0.028 0.000 0.245 0.000 variable.py:272(init) 10000 0.035 0.000 0.211 0.000 variable.py:487(_broadcast_indexes) 1140000/1100000 0.100 0.000 0.168 0.000 {built-in method builtins.isinstance} 10000 0.050 0.000 0.157 0.000 dataset.py:1802(_get_indexers_coords_and_indexes) 20000 0.025 0.000 0.153 0.000 dataset.py:868(_replace) 50000 0.085 0.000 0.152 0.000 variable.py:154(as_compatible_data) ```

Time breakdown:

Total | 1.881 -- | -- DataArray.__getitem__ | 0.080 DataArray.isel (_to_temp_dataset roundtrip) | 0.234 Dataset.isel | 0.622 Dataset._validate_indexers | 0.570 Variable.isel | 0.056 Variable.__getitem__ | 0.319

I can spot a few low-hanging fruits there: - huge amount of time spent on _validate_indexers - Why is variable__init__ being called 5 times?!? I expected 0. - The bench strongly hints at the fact that we're creating on the fly dummy IndexVariables - We're casting the DataArray to a Dataset, converting the positional index to a dict, then converting it back to positional for each variable. Maybe it's a good idea to rewrite DataArray.sel/isel so that they don't use _to_temp_dataset?

So in short while I don't think we can feasibly close the order-of-magnitude gap (800x) with numpy, I suspect we could get at least a 5x speedup here.

{
    "total_count": 5,
    "+1": 5,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  416962458
Powered by Datasette · Queries took 0.721ms · About: xarray-datasette