home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 405740643

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/2289#issuecomment-405740643 https://api.github.com/repos/pydata/xarray/issues/2289 405740643 MDEyOklzc3VlQ29tbWVudDQwNTc0MDY0Mw== 1217238 2018-07-17T21:53:37Z 2018-07-17T21:53:37Z MEMBER

I assume you mean report.to_dataset('columns').to_dask_dataframe().to_csv(...)?

Yes, something like this :).

it doesn't support a MultiIndex on the first dimension, which I need. It could be worked around but only at the cost of a lot of ugly hacking.

By default (if set_index=False), xarray will put variables in separate columns rather than a MultiIndex when converting into a dask dataframe. So this should work fine for exporting to CSV. I'm pretty sure you don't actually need a MultiIndex on each CSV chunk, since you could just pass index=False in to_csv() instead.

We could also potentially add a dask equivalent to the DataArray.to_pandas() method, which would preserves the dimensionality of the argument (e.g., 2D DataArray directly to a 2D dask DataFrame).

  1. it doesn't support writing to a single file, which means I'd need to manually reassemble the file afterwards, which translates to both more code and either I/O ops or RAM sacrificed to /dev/shm.
  2. from my benchmarks, it's 12 to 20 times slower than my implementation. I did not analyse it and I'm completely unfamiliar with dask.dataframe, so I'm not sure where the bottleneck is, but the fact that it doesn't fork into subprocesses (while pandas.DataFrame.to_csv() does not release the GIL) makes me suspicious.

Both of these look like improvements that would be welcome in dask.dataframe, and benefit far more users there than downstream in xarray.

I have been intentionally trying to push more complex code related to distributed computing (e.g., queues and subprocesses) upstream to dask. So far, we have avoided all uses of explicit task graphs in xarray, and have only used dask.delayed in a few places.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  341355638
Powered by Datasette · Queries took 0.7ms · About: xarray-datasette