html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/2289#issuecomment-405755212,https://api.github.com/repos/pydata/xarray/issues/2289,405755212,MDEyOklzc3VlQ29tbWVudDQwNTc1NTIxMg==,1217238,2018-07-17T23:01:10Z,2018-07-17T23:01:10Z,MEMBER,"I would also be very happy to reference xarray_extras specifically (even including an example) for parallel CSV export in the [relevant section](http://xarray.pydata.org/en/stable/io.html#formats-supported-by-pandas) of our docs, which could be renamed ""CSV and other tabular formats"".","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,341355638
https://github.com/pydata/xarray/issues/2289#issuecomment-405750184,https://api.github.com/repos/pydata/xarray/issues/2289,405750184,MDEyOklzc3VlQ29tbWVudDQwNTc1MDE4NA==,1217238,2018-07-17T22:34:48Z,2018-07-17T22:34:48Z,MEMBER,"> Thing is, I don't know if performance on dask.dataframe is fixable without drastically changing its design.

I suppose we could at least ask?

> Also while I think dask.array is an amazing building block of xarray, dask.dataframe does feel quite redundant to me...

I agree somewhat, but I hope you also understand my reluctance to grow CSV export and distributed computing logic directly in xarray :). Distributed CSV writing is very clearly in scope for dask.dataframe.

If we can push this core logic into dask somewhere, I would welcome a thin `to_csv()` method in xarray that simply calls underlying dask method.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,341355638
https://github.com/pydata/xarray/issues/2289#issuecomment-405743746,https://api.github.com/repos/pydata/xarray/issues/2289,405743746,MDEyOklzc3VlQ29tbWVudDQwNTc0Mzc0Ng==,6213168,2018-07-17T22:05:29Z,2018-07-17T22:05:29Z,MEMBER,"Thing is, I don't know if performance on dask.dataframe is fixable without drastically changing its design. Also while I think dask.array is an amazing building block of xarray, dask.dataframe does feel quite redundant to me... ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,341355638
https://github.com/pydata/xarray/issues/2289#issuecomment-405740643,https://api.github.com/repos/pydata/xarray/issues/2289,405740643,MDEyOklzc3VlQ29tbWVudDQwNTc0MDY0Mw==,1217238,2018-07-17T21:53:37Z,2018-07-17T21:53:37Z,MEMBER,"> I assume you mean `report.to_dataset('columns').to_dask_dataframe().to_csv(...)`?

Yes, something like this :).

> it doesn't support a MultiIndex on the first dimension, which I need. It could be worked around but only at the cost of a lot of ugly hacking.

By default (if `set_index=False`), xarray will put variables in separate columns rather than a MultiIndex when converting into a dask dataframe. So this should work fine for exporting to CSV. I'm pretty sure you don't actually need a MultiIndex on each CSV chunk, since you could just pass `index=False` in `to_csv()` instead.

We could also potentially add a dask equivalent to the `DataArray.to_pandas()` method, which would preserves the dimensionality of the argument (e.g., 2D DataArray directly to a 2D dask DataFrame).

> 2. it doesn't support writing to a single file, which means I'd need to manually reassemble the file afterwards, which translates to both more code and either I/O ops or RAM sacrificed to /dev/shm.
> 3. from my benchmarks, it's 12 to 20 times slower than my implementation. I did not analyse it and I'm completely unfamiliar with dask.dataframe, so I'm not sure where the bottleneck is, but the fact that it doesn't fork into subprocesses (while pandas.DataFrame.to_csv() does not release the GIL) makes me suspicious.

Both of these look like improvements that would be welcome in dask.dataframe, and benefit far more users there than downstream in xarray.

I have been intentionally trying to push more complex code related to distributed computing (e.g., queues and subprocesses) upstream to dask. So far, we have avoided all uses of explicit task graphs in xarray, and have only used dask.delayed in a few places.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,341355638
https://github.com/pydata/xarray/issues/2289#issuecomment-405402029,https://api.github.com/repos/pydata/xarray/issues/2289,405402029,MDEyOklzc3VlQ29tbWVudDQwNTQwMjAyOQ==,6213168,2018-07-16T22:33:34Z,2018-07-16T22:33:34Z,MEMBER,"I assume you mean ``report.to_dataset('columns').to_dask_dataframe().to_csv(...)``?

There's several problems with that:
1. it doesn't support a MultiIndex on the first dimension, which I need. It *could* be worked around but only at the cost of a lot of ugly hacking.
2. it doesn't support writing to a single file, which means I'd need to manually reassemble the file afterwards, which translates to both more code and either I/O ops or RAM sacrificed to /dev/shm.
3. from my benchmarks, it's *12 to 20 times slower* than my implementation. I did not analyse it and I'm completely unfamiliar with ``dask.dataframe``, so I'm not sure where the bottleneck is, but the fact that it doesn't fork into subprocesses (while pandas.DataFrame.to_csv() does not release the GIL) makes me suspicious.

benchmarks: https://gist.github.com/crusaderky/89819258ff960d06136d45526f7d05db","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,341355638
https://github.com/pydata/xarray/issues/2289#issuecomment-405146697,https://api.github.com/repos/pydata/xarray/issues/2289,405146697,MDEyOklzc3VlQ29tbWVudDQwNTE0NjY5Nw==,1217238,2018-07-16T04:28:31Z,2018-07-16T04:28:31Z,MEMBER,"Interesting. Would it be equivalent to export to a dask dataframe and write that to CSVs, e.g., `xarray.concat(reports, dim='col').to_dask_dataframe().to_csv(...)`? Or is there some reason why that would be slower/less efficient?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,341355638