home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

8 rows where author_association = "MEMBER", issue = 416962458 and user = 5635139 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • max-sixty · 8 ✖

issue 1

  • Performance: numpy indexes small amounts of data 1000 faster than xarray · 8 ✖

author_association 1

  • MEMBER · 8 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
553601146 https://github.com/pydata/xarray/issues/2799#issuecomment-553601146 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDU1MzYwMTE0Ng== max-sixty 5635139 2019-11-13T21:03:23Z 2019-11-13T21:03:23Z MEMBER

That's great that's helpful @nbren12 . Maybe we should add to docs (we don't really have a performance section at the moment, maybe we start something on performance tips?)

There's some info on the differences in the Terminology that @gwgundersen wrote: https://github.com/pydata/xarray/blob/master/doc/terminology.rst#L18

Essentially: by indexing on the variable, you ignore the coordinates, and so skip a bunch of code that takes the object apart and puts it back together. A variable is much more similar to a numpy array, so you can't do sel, for example.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
552714604 https://github.com/pydata/xarray/issues/2799#issuecomment-552714604 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDU1MjcxNDYwNA== max-sixty 5635139 2019-11-12T03:10:39Z 2019-11-12T03:10:39Z MEMBER

One note: if you're indexing into a dataarray and don't care about the coords, index into the variable. 2x numpy time, rather than 30x:

```python In [26]: da = xr.tutorial.open_dataset('air_temperature')['air']

In [27]: da Out[27]: <xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> [3869000 values with dtype=float32] Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]

In [20]: %timeit da.variable[0] 28.2 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: %timeit da[0] 459 µs ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [22]: %timeit da.variable.values[0] 14.1 µs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

```

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
552646381 https://github.com/pydata/xarray/issues/2799#issuecomment-552646381 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDU1MjY0NjM4MQ== max-sixty 5635139 2019-11-11T22:29:58Z 2019-11-11T22:29:58Z MEMBER

TBC I think there's plenty we could do with relatively little complexity to speed up indexing operations on DataArrays. As an example, we could avoid the roundtrip to a temporary Dataset.

That's a different problem from making xarray as fast as indexing a numpy array, or allowing libraries to iterate through a DataArray in a hot loop.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
539100243 https://github.com/pydata/xarray/issues/2799#issuecomment-539100243 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDUzOTEwMDI0Mw== max-sixty 5635139 2019-10-07T16:39:54Z 2019-10-07T16:39:54Z MEMBER

Great analysis, thanks

Do we have any idea of which of those lines are offending? I used a tool line_profiler a while ago, but maybe we know already (I'm guessing it's the two _replace_with_new_dims lines?)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
469898607 https://github.com/pydata/xarray/issues/2799#issuecomment-469898607 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDQ2OTg5ODYwNw== max-sixty 5635139 2019-03-05T23:16:43Z 2019-03-05T23:16:43Z MEMBER

Cython + memoryviews isn't quite the right comparison here.

Right, tbc, I'm only referring to the top two lines of the pasted benchmark; i.e. once we enter python (even if only to access a numpy array) we're already losing a lot of the speed relative to the loop staying in C / Cython. So even if xarray were a python front-end to a C++ library, it still wouldn't be competitive if performance were paramount. ...unless pypy sped that up; I'd be v interested to see.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
469861382 https://github.com/pydata/xarray/issues/2799#issuecomment-469861382 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDQ2OTg2MTM4Mg== max-sixty 5635139 2019-03-05T21:19:31Z 2019-03-05T21:19:31Z MEMBER

To put the relative speed of numpy access into perspective, I found this insightful: https://jakevdp.github.io/blog/2012/08/08/memoryview-benchmarks/ (it's now a few years out of date, but I think the fundamentals still stand)

Pasted from there:

Summary Here are the timing results we've seen above:

Python + numpy: 6510 ms Cython + numpy: 668 ms Cython + memviews (slicing): 22 ms Cython + raw pointers: 2.47 ms Cython + memviews (no slicing): 2.45 ms

So if we're running an inner loop on an array, accessing it using numpy in python is an order of magnitude slower than accessing it using numpy in C (and that's an order of magnitude slower than using a slice, and that's an order of magnitude slower than using raw pointers)

So - let's definitely speed xarray up (your benchmarks are excellent, thank you again, and I think you're right there are opportunities for significant increases). But where speed is paramount above all else, we shouldn't use any access in python, let alone the niceties of xarray access.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
469449165 https://github.com/pydata/xarray/issues/2799#issuecomment-469449165 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDQ2OTQ0OTE2NQ== max-sixty 5635139 2019-03-04T22:33:03Z 2019-03-04T22:33:03Z MEMBER

You can always use xarray to process the data, and then extract the underlying array (da.values) for passing into something expecting an numpy array / for running fast(ish) loops (we do this frequently).

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
469445483 https://github.com/pydata/xarray/issues/2799#issuecomment-469445483 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDQ2OTQ0NTQ4Mw== max-sixty 5635139 2019-03-04T22:20:58Z 2019-03-04T22:20:58Z MEMBER

Thanks for the benchmarks @nbren12, and for the clear explanation @shoyer

While we could do some performance work on that loop, I think we're likely to see a material change by enabling the external library to access directly from the array, without a looped python call. That's consistent with the ideas @jhamman had a few days ago.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 145.862ms · About: xarray-datasette