github: issue_comments: 6 rows where issue = 341355638 sorted by updated

6 rows where issue = 341355638 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
405755212	https://github.com/pydata/xarray/issues/2289#issuecomment-405755212	https://api.github.com/repos/pydata/xarray/issues/2289	MDEyOklzc3VlQ29tbWVudDQwNTc1NTIxMg==	shoyer 1217238	2018-07-17T23:01:10Z	2018-07-17T23:01:10Z	MEMBER	I would also be very happy to reference xarray_extras specifically (even including an example) for parallel CSV export in the relevant section of our docs, which could be renamed "CSV and other tabular formats".	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	DataArray.to_csv() 341355638
405750184	https://github.com/pydata/xarray/issues/2289#issuecomment-405750184	https://api.github.com/repos/pydata/xarray/issues/2289	MDEyOklzc3VlQ29tbWVudDQwNTc1MDE4NA==	shoyer 1217238	2018-07-17T22:34:48Z	2018-07-17T22:34:48Z	MEMBER	Thing is, I don't know if performance on dask.dataframe is fixable without drastically changing its design. I suppose we could at least ask? Also while I think dask.array is an amazing building block of xarray, dask.dataframe does feel quite redundant to me... I agree somewhat, but I hope you also understand my reluctance to grow CSV export and distributed computing logic directly in xarray :). Distributed CSV writing is very clearly in scope for dask.dataframe. If we can push this core logic into dask somewhere, I would welcome a thin `to_csv()` method in xarray that simply calls underlying dask method.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	DataArray.to_csv() 341355638
405743746	https://github.com/pydata/xarray/issues/2289#issuecomment-405743746	https://api.github.com/repos/pydata/xarray/issues/2289	MDEyOklzc3VlQ29tbWVudDQwNTc0Mzc0Ng==	crusaderky 6213168	2018-07-17T22:05:29Z	2018-07-17T22:05:29Z	MEMBER	Thing is, I don't know if performance on dask.dataframe is fixable without drastically changing its design. Also while I think dask.array is an amazing building block of xarray, dask.dataframe does feel quite redundant to me...	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	DataArray.to_csv() 341355638
405740643	https://github.com/pydata/xarray/issues/2289#issuecomment-405740643	https://api.github.com/repos/pydata/xarray/issues/2289	MDEyOklzc3VlQ29tbWVudDQwNTc0MDY0Mw==	shoyer 1217238	2018-07-17T21:53:37Z	2018-07-17T21:53:37Z	MEMBER	I assume you mean `report.to_dataset('columns').to_dask_dataframe().to_csv(...)`? Yes, something like this :). it doesn't support a MultiIndex on the first dimension, which I need. It could be worked around but only at the cost of a lot of ugly hacking. By default (if `set_index=False`), xarray will put variables in separate columns rather than a MultiIndex when converting into a dask dataframe. So this should work fine for exporting to CSV. I'm pretty sure you don't actually need a MultiIndex on each CSV chunk, since you could just pass `index=False` in `to_csv()` instead. We could also potentially add a dask equivalent to the `DataArray.to_pandas()` method, which would preserves the dimensionality of the argument (e.g., 2D DataArray directly to a 2D dask DataFrame). it doesn't support writing to a single file, which means I'd need to manually reassemble the file afterwards, which translates to both more code and either I/O ops or RAM sacrificed to /dev/shm. from my benchmarks, it's 12 to 20 times slower than my implementation. I did not analyse it and I'm completely unfamiliar with dask.dataframe, so I'm not sure where the bottleneck is, but the fact that it doesn't fork into subprocesses (while pandas.DataFrame.to_csv() does not release the GIL) makes me suspicious. Both of these look like improvements that would be welcome in dask.dataframe, and benefit far more users there than downstream in xarray. I have been intentionally trying to push more complex code related to distributed computing (e.g., queues and subprocesses) upstream to dask. So far, we have avoided all uses of explicit task graphs in xarray, and have only used dask.delayed in a few places.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	DataArray.to_csv() 341355638
405402029	https://github.com/pydata/xarray/issues/2289#issuecomment-405402029	https://api.github.com/repos/pydata/xarray/issues/2289	MDEyOklzc3VlQ29tbWVudDQwNTQwMjAyOQ==	crusaderky 6213168	2018-07-16T22:33:34Z	2018-07-16T22:33:34Z	MEMBER	I assume you mean `report.to_dataset('columns').to_dask_dataframe().to_csv(...)`? There's several problems with that: 1. it doesn't support a MultiIndex on the first dimension, which I need. It could be worked around but only at the cost of a lot of ugly hacking. 2. it doesn't support writing to a single file, which means I'd need to manually reassemble the file afterwards, which translates to both more code and either I/O ops or RAM sacrificed to /dev/shm. 3. from my benchmarks, it's 12 to 20 times slower than my implementation. I did not analyse it and I'm completely unfamiliar with `dask.dataframe`, so I'm not sure where the bottleneck is, but the fact that it doesn't fork into subprocesses (while pandas.DataFrame.to_csv() does not release the GIL) makes me suspicious. benchmarks: https://gist.github.com/crusaderky/89819258ff960d06136d45526f7d05db	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	DataArray.to_csv() 341355638
405146697	https://github.com/pydata/xarray/issues/2289#issuecomment-405146697	https://api.github.com/repos/pydata/xarray/issues/2289	MDEyOklzc3VlQ29tbWVudDQwNTE0NjY5Nw==	shoyer 1217238	2018-07-16T04:28:31Z	2018-07-16T04:28:31Z	MEMBER	Interesting. Would it be equivalent to export to a dask dataframe and write that to CSVs, e.g., `xarray.concat(reports, dim='col').to_dask_dataframe().to_csv(...)`? Or is there some reason why that would be slower/less efficient?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	DataArray.to_csv() 341355638

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);