home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 99026442

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
99026442 MDU6SXNzdWU5OTAyNjQ0Mg== 516 Wall time much greater than CPU time 3688009 closed 0     5 2015-08-04T18:10:50Z 2016-12-29T01:09:54Z 2016-12-29T01:09:54Z NONE      

I have a very large data set. It is multiple files, and each file is internally compressed. I have it all locally downloaded unto my external hard drive (the data is so large, this is the only place I can currently hold it, and I think this is some effect, but I am not sure if it is the entire effect. It is a USB 3 external though, which helps. Not ideal either way).

By loading in my data with xray.open_mfdataset(filename, chunks={'time':365, 'lat':180, 'lon':360})

and doing a simple call along one of the dimensions, it takes:

%time data.tasmax[:, 360, 720].values CPU times: user 31.4 s, sys: 4.07 s, total: 35.4 s Wall time: 8min 38s

So, i decided that maybe the internal compression was really my limiting factor. So I saved my entire data set using

data.to_netcdf('nc_test.nc')

which resulted in a 268 Gb files, and it took between 12 and 16 hours to run. Now, I can load in that new dataset, using xray.open_dataset instead of open_mfdataset.

So, i tried a few different loads to see if chunking helped the issue.

``` data = xray.open_mfdataset('/mnt/usb/CANESM2/tasmaxrcp45Can*', chunks={'time':365, 'lat':180, 'lon':360})

new = xray.open_dataset('/mnt/usb/CANESM2/nasa_CANESM2_prediction.nc', chunks={'time':365, 'lat':180, 'lon':360})

new2 = xray.open_dataset('/mnt/usb/CANESM2/nasa_CANESM2_prediction.nc', chunks={'time':365, 'lat':360, 'lon':720})

new3 = xray.open_dataset('/mnt/usb/CANESM2/nasa_CANESM2_prediction.nc', chunks={'time':365, 'lat':720, 'lon':1440})

new4 = xray.open_dataset('/mnt/usb/CANESM2/nasa_CANESM2_prediction.nc', chunks={'time':34675, 'lat':720, 'lon':1440})

new5 = xray.open_mfdataset('/mnt/usb/CANESM2/nasa_CANESM2_prediction.nc', chunks={'time':365, 'lat':180, 'lon':360}) ```

and the resulting times (note that I change the index in each one, to avoid file caching):

``` %time data.tasmax[:, 360, 720].values CPU times: user 31.4 s, sys: 4.07 s, total: 35.4 s Wall time: 8min 38s

%time new.tasmax[:, 360, 720].values CPU times: user 1.53 s, sys: 2.9 s, total: 4.43 s Wall time: 5min 35s

%time new2.tasmax[:, 362, 721].values CPU times: user 817 ms, sys: 2.89 s, total: 3.71 s Wall time: 4min 7s

%time new3.tasmax[:, 361, 720].values CPU times: user 987 ms, sys: 3.34 s, total: 4.33 s Wall time: 4min 17s

%time new4.tasmax[:, 360, 720].values CPU times: user 713 ms, sys: 2.68 s, total: 3.4 s Wall time: 6min 5s

%time new5.tasmax[:, 361, 720].values CPU times: user 1.25 s, sys: 2.79 s, total: 4.04 s Wall time: 5min 2s ```

The wall time is always greater than the CPU time (and CPU and Sys combined). Any insight? Can provide more info on request.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/516/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 5 rows from issue in issue_comments
Powered by Datasette · Queries took 79.493ms · About: xarray-datasette