issue_comments: 370064483

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/1385#issuecomment-370064483	https://api.github.com/repos/pydata/xarray/issues/1385	370064483	MDEyOklzc3VlQ29tbWVudDM3MDA2NDQ4Mw==	1197350	2018-03-02T21:57:26Z	2018-03-02T21:57:26Z	MEMBER	An update on this long-standing issue. I have learned that `open_mfdataset` can be blazingly fast if `decode_cf=False` but extremely slow with `decode_cf=True`. As an example, I am loading a POP datataset on cheyenne. Anyone with access can try this example. ```python base_dir = '/glade/scratch/rpa/' prefix = 'BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001' code = 'pop.h.nday1.SST' glob_pattern = os.path.join(base_dir, prefix, '%s.%s..nc' % (prefix, code)) def non_time_coords(ds): return [v for v in ds.data_vars if 'time' not in ds[v].dims] def drop_non_essential_vars_pop(ds): return ds.drop(non_time_coords(ds)) this runs almost instantly ds = xr.open_mfdataset(glob_pattern, decode_times=False, chunks={'time': 1}, preprocess=drop_non_essential_vars_pop, decode_cf=False) `And returns this` <xarray.Dataset> Dimensions: (d2: 2, nlat: 2400, nlon: 3600, time: 16401, z_t: 62, z_t_150m: 15, z_w: 62, z_w_bot: 62, z_w_top: 62) Coordinates: z_w_top (z_w_top) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 ... * z_t (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ... * z_w (z_w) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ... * z_t_150m (z_t_150m) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ... * z_w_bot (z_w_bot) float32 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ... * time (time) float64 7.322e+05 7.322e+05 7.322e+05 7.322e+05 ... Dimensions without coordinates: d2, nlat, nlon Data variables: time_bound (time, d2) float64 dask.array<shape=(16401, 2), chunksize=(1, 2)> SST (time, nlat, nlon) float32 dask.array<shape=(16401, 2400, 3600), chunksize=(1, 2400, 3600)> Attributes: nsteps_total: 480 tavg_sum: 64800.0 title: BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001 start_time: This dataset was created on 2016-03-14 at 05:32:30.3 Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf/CF-curren... source: CCSM POP2, the CCSM Ocean Component cell_methods: cell_methods = time: mean ==> the variable values are aver... calendar: All years have exactly 365 days. history: none contents: Diagnostic and Prognostic Variables revision: $Id: tavg.F90 56176 2013-12-20 18:35:46Z mlevy@ucar.edu $ ``` This is roughly 45 years of daily data, one file per year. Instead, if I just change `decode_cf=True` (the default), it takes forever. I can monitor what is happening via the distributed dashboard. It looks like this: There are more of these `open_dataset` tasks then there are number of files (45), so I can only presume there are 16401 individual tasks (one for each timestep), which each takes about 1 s in serial. This is a real failure of lazy decoding. Maybe it can be fixed by #1725, possibly related to #1372. cc Pangeo folks: @jhamman, @mrocklin	{ "total_count": 2, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 2, "heart": 0, "rocket": 0, "eyes": 0 }		224553135

issue_comments: 370064483

this runs almost instantly