home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

12 rows where author_association = "CONTRIBUTOR" and issue = 416962458 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • nbren12 6
  • hmaarrfk 4
  • ashwinvis 2

issue 1

  • Performance: numpy indexes small amounts of data 1000 faster than xarray · 12 ✖

author_association 1

  • CONTRIBUTOR · 12 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1306327743 https://github.com/pydata/xarray/issues/2799#issuecomment-1306327743 https://api.github.com/repos/pydata/xarray/issues/2799 IC_kwDOAMm_X85N3Pq_ hmaarrfk 90008 2022-11-07T22:45:07Z 2022-11-07T22:45:07Z CONTRIBUTOR

As I've been recently going down this performance rabbit hole, I think the discussion around https://github.com/pydata/xarray/issues/7045 is relevant and provides some additional historical context as to "why" this performance penalty might be happening.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
786813358 https://github.com/pydata/xarray/issues/2799#issuecomment-786813358 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDc4NjgxMzM1OA== hmaarrfk 90008 2021-02-26T18:19:28Z 2021-02-26T18:19:28Z CONTRIBUTOR

I hope the following can help users that struggle with the speed of xarray:

I've found that when doing numerical computation, I often use the xarray to grab all the metadata relevant to my computation. Scale, chromaticity, experimental information.

Eventually, i create a function that acts as a barrier: - Xarray input (high level experimental data) - Computation parameters output (low level implementation detail relevant information).

The low level implementation can operate on the fast numpy arrays. I've found this to be the struggle with creating high level APIs that do things like sanitize inputs (xarray routines like _validate_indexers and _broadcast_indexes) and low level APIs that are simply interested in moving and computing data.

For the example that @nbren12 brought up originally, it might be better to create xarray routines (if they don't exist already) that can create fast iterators for the underlying numpy arrays given a set of dimensions that the user cares about.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
786764651 https://github.com/pydata/xarray/issues/2799#issuecomment-786764651 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDc4Njc2NDY1MQ== nbren12 1386642 2021-02-26T16:51:50Z 2021-02-26T16:51:50Z CONTRIBUTOR

@jhamman Weren't you talking about an xarray lite (TM) package?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
553294966 https://github.com/pydata/xarray/issues/2799#issuecomment-553294966 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDU1MzI5NDk2Ng== nbren12 1386642 2019-11-13T08:32:05Z 2019-11-13T08:32:16Z CONTRIBUTOR

This variable workaround is awesome @max-sixty. Are there any guidelines on when to use Variable vs DataArray? Some calculations (e.g. fast difference and derivatives/stencil operations) seem cleaner without explicit coordinate labels.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
552652019 https://github.com/pydata/xarray/issues/2799#issuecomment-552652019 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDU1MjY1MjAxOQ== hmaarrfk 90008 2019-11-11T22:47:47Z 2019-11-11T22:47:47Z CONTRIBUTOR

Sure, I just wanted to make the note that this operation should be more or less constant time, as opposed to dependent on the size of the array. Somebody had mentionned it should increase with the size of the array.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
552619589 https://github.com/pydata/xarray/issues/2799#issuecomment-552619589 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDU1MjYxOTU4OQ== hmaarrfk 90008 2019-11-11T21:16:36Z 2019-11-11T21:16:36Z CONTRIBUTOR

Hmm, slicing should basically be a no-op.

The fact that xarray makes it about 100x slower is a real killer. It seems from this conversation that it might be hard to workaround

```python import xarray as xr import numpy as np n = np.zeros(shape=(1024, 1024)) x = xr.DataArray(n, dims=('y', 'x')) the_slice = np.s_[256:512, 256:512] %timeit n[the_slice] %timeit x[the_slice] 186 ns ± 0.778 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) 70.3 µs ± 593 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ```
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
539352070 https://github.com/pydata/xarray/issues/2799#issuecomment-539352070 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDUzOTM1MjA3MA== ashwinvis 9155111 2019-10-08T06:08:27Z 2019-10-08T06:08:48Z CONTRIBUTOR

I suspect system jitter in the profiling as the time for Dataset.isel went up. It would be useful to run sudo python -m pyperf system tune before running profiler/benchmarks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
538366978 https://github.com/pydata/xarray/issues/2799#issuecomment-538366978 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDUzODM2Njk3OA== ashwinvis 9155111 2019-10-04T11:57:10Z 2019-10-04T11:57:10Z CONTRIBUTOR

At first sight it looks somewhat like a hybrid between Cython (for the ahead-of-time transpiling to C++) and numba (for having python-compatible syntax).

Not really. Pythran always releases the GIL and does a bunch of optimizations between transpilation and compilations.

A good approach would be try out different compilers and see what performance is obtained, without losing readability (https://github.com/pydata/xarray/issues/2799#issuecomment-469444519). See scikit-image/scikit-image/issues/4199 where the package transonic was being experimentally tested to replace Cython-only code with python code + type hints. As a bonus, you get to switch between Cython, Pythran and Numba,

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
469451210 https://github.com/pydata/xarray/issues/2799#issuecomment-469451210 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDQ2OTQ1MTIxMA== nbren12 1386642 2019-03-04T22:40:07Z 2019-03-04T22:40:07Z CONTRIBUTOR

Sure, I've been using that as a workaround as well. Unfortunately, that approach throws away all the nice info (e.g. metadata, coordinate) that xarray objects have and requires duplicating much of xarray's indexing logic.

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
469447632 https://github.com/pydata/xarray/issues/2799#issuecomment-469447632 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDQ2OTQ0NzYzMg== nbren12 1386642 2019-03-04T22:27:57Z 2019-03-04T22:27:57Z CONTRIBUTOR

@max-sixty I tend to agree this use case could be outside of the scope of xarray. It sounds like significant progress might require re-implementing core xarray objects in C/Cython. Without more than 10x improvement, I would probably just continue using numpy arrays.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
469443856 https://github.com/pydata/xarray/issues/2799#issuecomment-469443856 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDQ2OTQ0Mzg1Ng== nbren12 1386642 2019-03-04T22:15:49Z 2019-03-04T22:15:49Z CONTRIBUTOR

Thanks so much @shoyer. I didn't realize there was that much overhead for a single function call. OTOH, 2x slower than numpy would be way better than 1000x.

After looking at the profiling info more, I tend to agree with your 10x maximum speed-up. A couple of particularly slow functions (e.g. Dataset._validate_indexers) account for about 75% of run time. However, the remaining 25% is split across several other pure python routines.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
469394020 https://github.com/pydata/xarray/issues/2799#issuecomment-469394020 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDQ2OTM5NDAyMA== nbren12 1386642 2019-03-04T19:45:11Z 2019-03-04T19:45:11Z CONTRIBUTOR

cc @rabernat

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 19.016ms · About: xarray-datasette