home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 218260909

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
218260909 MDU6SXNzdWUyMTgyNjA5MDk= 1340 round-trip performance with save_mfdataset / open_mfdataset 1197350 closed 0     11 2017-03-30T16:52:26Z 2019-05-01T22:12:06Z 2019-05-01T22:12:06Z MEMBER      

I have encountered some major performance bottlenecks in trying to write and then read multi-file netcdf datasets.

I start with an xarray dataset created by xgcm with the following repr: <xarray.Dataset> Dimensions: (XC: 400, XG: 400, YC: 400, YG: 400, Z: 40, Zl: 40, Zp1: 41, Zu: 40, layer_1TH_bounds: 43, layer_1TH_center: 42, layer_1TH_interface: 41, time: 1566) Coordinates: iter (time) int64 8294400 8294976 8295552 8296128 ... * time (time) int64 8294400 8294976 8295552 8296128 ... * XC (XC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ... * YG (YG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ... * XG (XG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ... * YC (YC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ... * Zu (Zu) >f4 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 -91.0 ... * Zl (Zl) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ... * Zp1 (Zp1) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ... * Z (Z) >f4 -5.0 -15.0 -25.0 -36.0 -49.0 -64.0 -81.5 ... rAz (YG, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... dyC (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... rAw (YC, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... dxC (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... dxG (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... dyG (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... rAs (YG, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... Depth (YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... rA (YC, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... PHrefF (Zp1) >f4 0.0 98.1 196.2 294.3 412.02 549.36 706.32 ... PHrefC (Z) >f4 49.05 147.15 245.25 353.16 480.69 627.84 ... drC (Zp1) >f4 5.0 10.0 10.0 11.0 13.0 15.0 17.5 20.5 ... drF (Z) >f4 10.0 10.0 10.0 12.0 14.0 16.0 19.0 22.0 ... hFacC (Z, YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... hFacW (Z, YC, XG) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... hFacS (Z, YG, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... * layer_1TH_bounds (layer_1TH_bounds) >f4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 ... * layer_1TH_interface (layer_1TH_interface) >f4 0.0 0.2 0.4 0.6 0.8 1.0 ... * layer_1TH_center (layer_1TH_center) float32 -0.1 0.1 0.3 0.5 0.7 0.9 ... Data variables: T (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... U (time, Z, YC, XG) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... V (time, Z, YG, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... S (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... Eta (time, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... W (time, Zl, YC, XC) float32 -0.0 -0.0 -0.0 -0.0 -0.0 ...

An important point to note is that there are lots of "non-dimension coordinates" corresponding to various parameters of the numerical grid.

I save this dataset to a multi-file netCDF dataset as follows: python iternums, datasets = zip(*ds.groupby('time')) paths = [outdir + 'xmitgcm_data.%010d.nc' % it for it in iternums] xr.save_mfdataset(datasets, paths) This takes many hours to run, since it has to read and write all the data. (I think there are some performance issues here too, related to how dask schedules the read / write tasks, but that is probably a separate issue.)

Then I try to re-load this dataset python ds_nc = xr.open_mfdataset('xmitgcm_data.*.nc')

This raises an error: ValueError: too many different dimensions to concatenate: {'YG', 'Z', 'Zl', 'Zp1', 'layer_1TH_interface', 'YC', 'XC', 'layer_1TH_center', 'Zu', 'layer_1TH_bounds', 'XG'}

I need to specify concat_dim='time' in order to properly concatenate the data. It seems like this should be unnecessary, since I am reading back data that was just written with xarray, but I understand why (the dimensions of the Data Variables in each file are just Z, YC, XC, with no time dimension). Once I do that, it works, but it takes 18 minutes to load the dataset. I assume this is because it has to check the compatibility of all all the non-dimension coordinates.

I just thought I would document this, because 18 minutes seems way too long to load a dataset.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1340/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 11 rows from issue in issue_comments
Powered by Datasette · Queries took 0.835ms · About: xarray-datasette