home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 341355638

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
341355638 MDU6SXNzdWUzNDEzNTU2Mzg= 2289 DataArray.to_csv() 6213168 closed 0     6 2018-07-15T21:56:20Z 2019-03-12T15:01:18Z 2019-03-12T15:01:18Z MEMBER      

I'm using xarray to aggregate 38 GB worth of NetCDF data into a bunch of CSV reports. I have two problems:

  1. The reports are 500,000 rows by 2,000 columns. Before somebody says "if you're using CSV for this size of data you're doing it wrong" - yes, I know, but it was the only way to make the data accessible to a bunch of people that only know how to use Excel and VBA. :tired_face: The sheer size of the reports means that (1) it's unsavory to keep the whole thing in RAM (2) pandas to_csv will take ages to complete (as it's single-threaded). The slowness is compounded by the fact that I have to compress everything with gzip.
  2. I have to produce up to 40 reports from the exact same NetCDF files. I use dask to perform the computation, and different reports share a large amount of intermediate graph nodes. So I need to do everything in a single invocation to dask.compute() to allow the dask scheduler to de-duplicate the nodes.

To solve both problems, I wrote a new function: http://xarray-extras.readthedocs.io/en/latest/api/csv.html

And now my high level wrapper code looks like this: ```

DataSet from 200 .nc files, with a total of 500000 points on the 'row' dimension

nc = xarray.open_mfdataset('inputs..nc') reports = [ # DataArrays with shape (500000, 2000), with the rows split in 200 chunks gen_report0(nc), gen_report1(nc), .... gen_report39(nc), ] futures = [ # dask.delayed objects to_csv(reports[0], 'report0.csv.gz', compression='gzip'), to_csv(reports[1], 'report1.csv.gz', compression='gzip'), .... to_csv(reports[39], 'report39.csv.gz', compression='gzip'), ] dask.compute(futures) ``` The function is currently production quality in xarray-extras, but it would be very easy to refactor it as a method of xarray.DataArray in the main library.

Opinions?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2289/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 6 rows from issue in issue_comments
Powered by Datasette · Queries took 0.706ms · About: xarray-datasette