home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where author_association = "MEMBER", issue = 416962458 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

user 1

  • shoyer · 5 ✖

issue 1

  • Performance: numpy indexes small amounts of data 1000 faster than xarray · 5 ✖

author_association 1

  • MEMBER · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
786800631 https://github.com/pydata/xarray/issues/2799#issuecomment-786800631 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDc4NjgwMDYzMQ== shoyer 1217238 2021-02-26T17:56:07Z 2021-02-26T17:56:07Z MEMBER

I agree, I think a "xarray lite" package with only named dimensions could indeed be a valuable contribution.

I'd love to optimize xarray further, but I suspect you would probably have to write the core in a language like C++ to achieve similar performance to NumPy.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
552655149 https://github.com/pydata/xarray/issues/2799#issuecomment-552655149 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDU1MjY1NTE0OQ== shoyer 1217238 2019-11-11T22:57:55Z 2019-11-11T22:57:55Z MEMBER

Sure, I just wanted to make the note that this operation should be more or less constant time, as opposed to dependent on the size of the array.

Yes, I think this is still the case for slicing in xarray. There's just much larger constant overhead than in NumPy. (And this is difficult to fix short of rewriting xarray's core in C.)

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
469869298 https://github.com/pydata/xarray/issues/2799#issuecomment-469869298 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDQ2OTg2OTI5OA== shoyer 1217238 2019-03-05T21:43:18Z 2019-03-05T21:43:32Z MEMBER

Cython + memoryviews isn't quite the right comparison here. I'm sure ordering here is correct, but relative magnitude of the performance difference should be smaller.

Xarray's core is bottlenecked on: 1. Overhead of abstraction with normal Python operations (e.g., function calls) in non-numeric code (all the heavy numerics is offloaded to NumPy or pandas). 2. The dynamic nature of our APIs, which means we need to do lots of type checking. Notice how high up builtins.isinstance appears in that performance profile!

C++ offers very low-cost abstraction but dynamism is still slow. Even then, compilers are much better at speeding up tight numeric loops than complex domain logic.

As a point of reference, it would be interesting to see these performance numbers running pypy, which I think should be able to handle everything in xarray. You'll note that pypy is something like 7x faster than CPython in their benchmark suite, which I suspect is closer to what we'd see if we wrote xarray's core in a language like C++, e.g., as Python interface to xframe.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
469444519 https://github.com/pydata/xarray/issues/2799#issuecomment-469444519 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDQ2OTQ0NDUxOQ== shoyer 1217238 2019-03-04T22:17:58Z 2019-03-04T22:17:58Z MEMBER

To be clear, pull requests improving performance (without significantly loss of readability) would be very welcome. Be sure to include a new benchmark in our benchmark suite.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
469439957 https://github.com/pydata/xarray/issues/2799#issuecomment-469439957 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDQ2OTQzOTk1Nw== shoyer 1217238 2019-03-04T22:03:37Z 2019-03-04T22:16:49Z MEMBER

While python will always be slower than C when iterating over an array in this fashion, I would hope that xarray could be nearly as fast as numpy. I am not sure what the best way to improve this is though.

I'm sure it's possible to optimize this significantly, but short of rewriting this logic in a lower level language it's pretty much impossible to match the speed of NumPy.

This benchmark might give some useful context: ``` def dummy_isel(args, *kwargs): pass

def index_dummy(named_indices, arr): for named_index in named_indices: dummy_isel(arr, **named_index) %%timeit -n 10 index_dummy(named_indices, arr) ```

On my machine, this is already twice as slow as your NumPy benchmark (497 µs vs 251 µs) , and all it's doing is parsing *args and **kwargs! Every Python function/method call involving keyword arguments adds about 0.5 ns of overhead, because the highly optimized dict is (relatively) slow compared to positional arguments. In my experience it is almost impossible to get the overhead of a Python function call below a few microseconds.

Right now we're at about 130 µs per indexing operation. In the best case, we might make this 10x faster but even that would be quite challenging, e.g., consider that even creating a DataArray takes about 20 µs.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 4790.554ms · About: xarray-datasette