home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

2 rows where issue = 341355638 and user = 6213168 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • crusaderky · 2 ✖

issue 1

  • DataArray.to_csv() · 2 ✖

author_association 1

  • MEMBER 2
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
405743746 https://github.com/pydata/xarray/issues/2289#issuecomment-405743746 https://api.github.com/repos/pydata/xarray/issues/2289 MDEyOklzc3VlQ29tbWVudDQwNTc0Mzc0Ng== crusaderky 6213168 2018-07-17T22:05:29Z 2018-07-17T22:05:29Z MEMBER

Thing is, I don't know if performance on dask.dataframe is fixable without drastically changing its design. Also while I think dask.array is an amazing building block of xarray, dask.dataframe does feel quite redundant to me...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.to_csv() 341355638
405402029 https://github.com/pydata/xarray/issues/2289#issuecomment-405402029 https://api.github.com/repos/pydata/xarray/issues/2289 MDEyOklzc3VlQ29tbWVudDQwNTQwMjAyOQ== crusaderky 6213168 2018-07-16T22:33:34Z 2018-07-16T22:33:34Z MEMBER

I assume you mean report.to_dataset('columns').to_dask_dataframe().to_csv(...)?

There's several problems with that: 1. it doesn't support a MultiIndex on the first dimension, which I need. It could be worked around but only at the cost of a lot of ugly hacking. 2. it doesn't support writing to a single file, which means I'd need to manually reassemble the file afterwards, which translates to both more code and either I/O ops or RAM sacrificed to /dev/shm. 3. from my benchmarks, it's 12 to 20 times slower than my implementation. I did not analyse it and I'm completely unfamiliar with dask.dataframe, so I'm not sure where the bottleneck is, but the fact that it doesn't fork into subprocesses (while pandas.DataFrame.to_csv() does not release the GIL) makes me suspicious.

benchmarks: https://gist.github.com/crusaderky/89819258ff960d06136d45526f7d05db

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.to_csv() 341355638

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 38.133ms · About: xarray-datasette