home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where issue = 365678022 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 2

  • max-sixty 3
  • mschrimpf 2

author_association 2

  • MEMBER 3
  • NONE 2

issue 1

  • DataArray.sel extremely slow · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
426335414 https://github.com/pydata/xarray/issues/2452#issuecomment-426335414 https://api.github.com/repos/pydata/xarray/issues/2452 MDEyOklzc3VlQ29tbWVudDQyNjMzNTQxNA== max-sixty 5635139 2018-10-02T16:15:00Z 2018-10-02T16:15:00Z MEMBER

Thanks @mschrimpf. Hopefully we can get multi-dimensional groupbys, too.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.sel extremely slow 365678022
426329601 https://github.com/pydata/xarray/issues/2452#issuecomment-426329601 https://api.github.com/repos/pydata/xarray/issues/2452 MDEyOklzc3VlQ29tbWVudDQyNjMyOTYwMQ== mschrimpf 5308236 2018-10-02T15:58:21Z 2018-10-02T15:58:21Z NONE

I posted a manual solution to the multi-dimensional grouping in the stackoverflow thread. Hopefully, .sel can be made more efficient though, it's such an everyday method.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.sel extremely slow 365678022
426100334 https://github.com/pydata/xarray/issues/2452#issuecomment-426100334 https://api.github.com/repos/pydata/xarray/issues/2452 MDEyOklzc3VlQ29tbWVudDQyNjEwMDMzNA== mschrimpf 5308236 2018-10-01T23:47:43Z 2018-10-02T14:29:18Z NONE

Thanks @max-sixty, the checks per call make sense, although I still find 0.5 ms insane for a single-value lookup (python dict-indexing takes about a 50th to index every single item in the array).

The reason I'm looking into this is actually multi-dimensional grouping (#2438) which is unfortunately not implemented (the above code is essentially a step towards trying to implement that). Is there a way of vectorizing these calls with that in mind? I.e. apply a method for each group.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.sel extremely slow 365678022
426106046 https://github.com/pydata/xarray/issues/2452#issuecomment-426106046 https://api.github.com/repos/pydata/xarray/issues/2452 MDEyOklzc3VlQ29tbWVudDQyNjEwNjA0Ng== max-sixty 5635139 2018-10-02T00:21:17Z 2018-10-02T00:21:17Z MEMBER

Is there a way of vectorizing these calls with that in mind? I.e. apply a method for each group.

I can't think of anything immediately, and doubt there's an easy way given it doesn't exist yet (though that logic can be a trap!). There's some hacky pandas reshaping you may be able to do to solve this as a one-off. Otherwise it does likely require a concerted effort with numbagg.

I occasionally hit this issue too, so as keen as you are to find a solution. Thanks for giving it a try.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.sel extremely slow 365678022
426096521 https://github.com/pydata/xarray/issues/2452#issuecomment-426096521 https://api.github.com/repos/pydata/xarray/issues/2452 MDEyOklzc3VlQ29tbWVudDQyNjA5NjUyMQ== max-sixty 5635139 2018-10-01T23:25:01Z 2018-10-01T23:25:01Z MEMBER

Thanks for the issue @mschrimpf

.sel is slow per operation, mainly because it's a python function call (although not the only reason - it's also doing a set of checks / potential alignments / etc). When I say 'slow', I mean about 0.5ms: In [6]: %timeit d.sel(a='a', b='a') 1000 loops, best of 3: 522 µs per loop

While there's an overhead, the time is fairly consistent regardless of the number of items it's selecting. For example: In [11]: %timeit d.sel(a=d['a'], b=d['b']) 1000 loops, best of 3: 1 ms per loop

So, as is often the case in the pandas / python ecosystem, if you can write code in a vectorized way, without using python in the tight loops, it's fast. If you need to run python in each loop, it's much slower.

Does that resonate?


While I think not the main point here, there might be some optimizations on sel. It runs isinstance 144 times! And initializes a collection 13 times? Here's the %prun of the 0.5ms command:

``` 1077 function calls (1066 primitive calls) in 0.002 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 6 0.000 0.000 0.000 0.000 coordinates.py:169(<genexpr>) 13 0.000 0.000 0.000 0.000 collections.py:50(__init__) 14 0.000 0.000 0.000 0.000 _abcoll.py:548(update) 33 0.000 0.000 0.000 0.000 _weakrefset.py:70(__contains__) 2 0.000 0.000 0.001 0.000 dataset.py:881(_construct_dataarray) 144 0.000 0.000 0.000 0.000 {isinstance} 1 0.000 0.000 0.001 0.001 dataset.py:1496(isel) 18 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array} 3 0.000 0.000 0.000 0.000 dataset.py:92(calculate_dimensions) 13 0.000 0.000 0.000 0.000 abc.py:128(__instancecheck__) 36 0.000 0.000 0.000 0.000 common.py:183(__setattr__) 2 0.000 0.000 0.000 0.000 coordinates.py:167(variables) 2 0.000 0.000 0.000 0.000 {method 'get_loc' of 'pandas._libs.index.IndexEngine' objects} 26 0.000 0.000 0.000 0.000 variable.py:271(shape) 65 0.000 0.000 0.000 0.000 collections.py:90(__iter__) 5 0.000 0.000 0.000 0.000 variable.py:136(as_compatible_data) 3 0.000 0.000 0.000 0.000 dataarray.py:165(__init__) 2 0.000 0.000 0.000 0.000 indexing.py:1255(__getitem__) 3 0.000 0.000 0.000 0.000 variable.py:880(isel) 14 0.000 0.000 0.000 0.000 collections.py:71(__setitem__) 1 0.000 0.000 0.000 0.000 dataset.py:1414(_validate_indexers) 6 0.000 0.000 0.000 0.000 coordinates.py:38(__iter__) 3 0.000 0.000 0.000 0.000 variable.py:433(_broadcast_indexes) 2 0.000 0.000 0.000 0.000 variable.py:1826(to_index) 3 0.000 0.000 0.000 0.000 dataset.py:636(_construct_direct) 2 0.000 0.000 0.000 0.000 indexing.py:122(convert_label_indexer) 15 0.000 0.000 0.000 0.000 utils.py:306(__init__) 3 0.000 0.000 0.000 0.000 indexing.py:17(expanded_indexer) 28 0.000 0.000 0.000 0.000 collections.py:138(iteritems) 1 0.000 0.000 0.001 0.001 indexing.py:226(remap_label_indexers) 15 0.000 0.000 0.000 0.000 numeric.py:424(asarray) 1 0.000 0.000 0.001 0.001 indexing.py:193(get_dim_indexers) 80/70 0.000 0.000 0.000 0.000 {len} ```
{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.sel extremely slow 365678022

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 4378.656ms · About: xarray-datasette