home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

where issue = 341355638 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • shoyer · 4 ✖

issue 1

  • DataArray.to_csv() · 4 ✖

author_association 1

  • MEMBER 4
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
405755212 https://github.com/pydata/xarray/issues/2289#issuecomment-405755212 https://api.github.com/repos/pydata/xarray/issues/2289 MDEyOklzc3VlQ29tbWVudDQwNTc1NTIxMg== shoyer 1217238 2018-07-17T23:01:10Z 2018-07-17T23:01:10Z MEMBER

I would also be very happy to reference xarray_extras specifically (even including an example) for parallel CSV export in the relevant section of our docs, which could be renamed "CSV and other tabular formats".

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.to_csv() 341355638
405750184 https://github.com/pydata/xarray/issues/2289#issuecomment-405750184 https://api.github.com/repos/pydata/xarray/issues/2289 MDEyOklzc3VlQ29tbWVudDQwNTc1MDE4NA== shoyer 1217238 2018-07-17T22:34:48Z 2018-07-17T22:34:48Z MEMBER

Thing is, I don't know if performance on dask.dataframe is fixable without drastically changing its design.

I suppose we could at least ask?

Also while I think dask.array is an amazing building block of xarray, dask.dataframe does feel quite redundant to me...

I agree somewhat, but I hope you also understand my reluctance to grow CSV export and distributed computing logic directly in xarray :). Distributed CSV writing is very clearly in scope for dask.dataframe.

If we can push this core logic into dask somewhere, I would welcome a thin to_csv() method in xarray that simply calls underlying dask method.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.to_csv() 341355638
405740643 https://github.com/pydata/xarray/issues/2289#issuecomment-405740643 https://api.github.com/repos/pydata/xarray/issues/2289 MDEyOklzc3VlQ29tbWVudDQwNTc0MDY0Mw== shoyer 1217238 2018-07-17T21:53:37Z 2018-07-17T21:53:37Z MEMBER

I assume you mean report.to_dataset('columns').to_dask_dataframe().to_csv(...)?

Yes, something like this :).

it doesn't support a MultiIndex on the first dimension, which I need. It could be worked around but only at the cost of a lot of ugly hacking.

By default (if set_index=False), xarray will put variables in separate columns rather than a MultiIndex when converting into a dask dataframe. So this should work fine for exporting to CSV. I'm pretty sure you don't actually need a MultiIndex on each CSV chunk, since you could just pass index=False in to_csv() instead.

We could also potentially add a dask equivalent to the DataArray.to_pandas() method, which would preserves the dimensionality of the argument (e.g., 2D DataArray directly to a 2D dask DataFrame).

  1. it doesn't support writing to a single file, which means I'd need to manually reassemble the file afterwards, which translates to both more code and either I/O ops or RAM sacrificed to /dev/shm.
  2. from my benchmarks, it's 12 to 20 times slower than my implementation. I did not analyse it and I'm completely unfamiliar with dask.dataframe, so I'm not sure where the bottleneck is, but the fact that it doesn't fork into subprocesses (while pandas.DataFrame.to_csv() does not release the GIL) makes me suspicious.

Both of these look like improvements that would be welcome in dask.dataframe, and benefit far more users there than downstream in xarray.

I have been intentionally trying to push more complex code related to distributed computing (e.g., queues and subprocesses) upstream to dask. So far, we have avoided all uses of explicit task graphs in xarray, and have only used dask.delayed in a few places.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.to_csv() 341355638
405146697 https://github.com/pydata/xarray/issues/2289#issuecomment-405146697 https://api.github.com/repos/pydata/xarray/issues/2289 MDEyOklzc3VlQ29tbWVudDQwNTE0NjY5Nw== shoyer 1217238 2018-07-16T04:28:31Z 2018-07-16T04:28:31Z MEMBER

Interesting. Would it be equivalent to export to a dask dataframe and write that to CSVs, e.g., xarray.concat(reports, dim='col').to_dask_dataframe().to_csv(...)? Or is there some reason why that would be slower/less efficient?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.to_csv() 341355638

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1408.013ms · About: xarray-datasette