html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/1301#issuecomment-344949160,https://api.github.com/repos/pydata/xarray/issues/1301,344949160,MDEyOklzc3VlQ29tbWVudDM0NDk0OTE2MA==,10554254,2017-11-16T15:01:59Z,2017-11-16T15:02:48Z,NONE,"Looks like it has been resolved! Tested with the latest pre-release v0.10.0rc2 on the dataset linked by najascutellatus above. https://marine.rutgers.edu/~michaesm/netcdf/data/ ``` da.set_options(get=da.async.get_sync) %prun -l 10 ds = xr.open_mfdataset('./*.nc') ``` xarray==0.10.0rc2-1-g8267fdb dask==0.15.4 ``` 194381 function calls (188429 primitive calls) in 0.869 seconds Ordered by: internal time List reduced from 469 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 50 0.393 0.008 0.393 0.008 {numpy.core.multiarray.arange} 50 0.164 0.003 0.557 0.011 indexing.py:266(_index_indexer_1d) 5 0.083 0.017 0.085 0.017 netCDF4_.py:185(_open_netcdf4_group) 190 0.024 0.000 0.066 0.000 netCDF4_.py:256(open_store_variable) 190 0.022 0.000 0.022 0.000 netCDF4_.py:29(__init__) 50 0.018 0.000 0.021 0.000 {operator.getitem} 5145/3605 0.012 0.000 0.019 0.000 indexing.py:493(shape) 2317/1291 0.009 0.000 0.094 0.000 _abcoll.py:548(update) 26137 0.006 0.000 0.013 0.000 {isinstance} 720 0.005 0.000 0.006 0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects} ``` xarray==0.9.1 dask==0.13.0 ``` 241253 function calls (229881 primitive calls) in 98.123 seconds Ordered by: internal time List reduced from 659 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 30 87.527 2.918 87.527 2.918 {pandas._libs.tslib.array_to_timedelta64} 65 7.055 0.109 7.059 0.109 {operator.getitem} 80 0.799 0.010 0.799 0.010 {numpy.core.multiarray.arange} 7895/4420 0.502 0.000 0.524 0.000 utils.py:412(shape) 68 0.442 0.007 0.442 0.007 {pandas._libs.algos.ensure_object} 80 0.350 0.004 1.150 0.014 indexing.py:318(_index_indexer_1d) 60/30 0.296 0.005 88.407 2.947 timedeltas.py:158(_convert_listlike) 30 0.284 0.009 0.298 0.010 algorithms.py:719(checked_add_with_arr) 123 0.140 0.001 0.140 0.001 {method 'astype' of 'numpy.ndarray' objects} 1049/719 0.096 0.000 96.513 0.134 {numpy.core.multiarray.array} ```","{""total_count"": 3, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 2, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278 https://github.com/pydata/xarray/issues/1301#issuecomment-293619896,https://api.github.com/repos/pydata/xarray/issues/1301,293619896,MDEyOklzc3VlQ29tbWVudDI5MzYxOTg5Ng==,10554254,2017-04-12T15:42:18Z,2017-04-12T15:42:18Z,NONE,"decode_times=False significantly reduces read time, but the proportional performance discrepancy between xarray 0.8.2 and 0.9.1 remains the same.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278 https://github.com/pydata/xarray/issues/1301#issuecomment-293593843,https://api.github.com/repos/pydata/xarray/issues/1301,293593843,MDEyOklzc3VlQ29tbWVudDI5MzU5Mzg0Mw==,865212,2017-04-12T14:24:44Z,2017-04-12T14:25:29Z,NONE,@friedrichknuth Did you try tests with the most recent version `decode_times`=True/False on a single file read?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278 https://github.com/pydata/xarray/issues/1301#issuecomment-291512017,https://api.github.com/repos/pydata/xarray/issues/1301,291512017,MDEyOklzc3VlQ29tbWVudDI5MTUxMjAxNw==,1360241,2017-04-04T14:11:08Z,2017-04-04T14:11:08Z,NONE,@rabernat This data is computed on demand from the OOI (http://oceanobservatories.org/cyberinfrastructure-technology/). Datasets can be massive and so they seem to be split up in ~500 MB files when data gets too big. That is why obs changes for each file. Would having obs be consistent across all files potentially make open_mfdataset faster?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278 https://github.com/pydata/xarray/issues/1301#issuecomment-286220522,https://api.github.com/repos/pydata/xarray/issues/1301,286220522,MDEyOklzc3VlQ29tbWVudDI4NjIyMDUyMg==,10554254,2017-03-13T19:41:25Z,2017-03-13T19:41:25Z,NONE,"Looks like the issue might be that xarray 0.9.1 is decoding all timestamps on load. xarray==0.9.1, dask==0.13.0 ``` da.set_options(get=da.async.get_sync) %prun -l 10 ds = xr.open_mfdataset('./*.nc') 167305 function calls (160352 primitive calls) in 59.688 seconds Ordered by: internal time List reduced from 625 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 18 57.057 3.170 57.057 3.170 {pandas.tslib.array_to_timedelta64} 39 0.860 0.022 0.863 0.022 {operator.getitem} 48 0.402 0.008 0.402 0.008 {numpy.core.multiarray.arange} 4341/2463 0.257 0.000 0.273 0.000 utils.py:412(shape) 88 0.245 0.003 0.245 0.003 {pandas.algos.ensure_object} 48 0.158 0.003 0.561 0.012 indexing.py:318(_index_indexer_1d) 36/18 0.135 0.004 57.509 3.195 timedeltas.py:150(_convert_listlike) 18 0.126 0.007 0.130 0.007 nanops.py:815(_checked_add_with_arr) 51 0.070 0.001 0.070 0.001 {method 'astype' of 'numpy.ndarray' objects} 676/475 0.047 0.000 58.853 0.124 {numpy.core.multiarray.array} ``` `pandas.tslib.array_to_timedelta64` appears to be the most expensive item on the list, and isn't being run when using xarray 0.8.2. xarray==0.8.2, dask==0.13.0 ``` da.set_options(get=da.async.get_sync) %prun -l 10 ds = xr.open_mfdataset('./*.nc') 140668 function calls (136769 primitive calls) in 0.766 seconds Ordered by: internal time List reduced from 621 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 2571/1800 0.178 0.000 0.184 0.000 utils.py:387(shape) 18 0.174 0.010 0.174 0.010 {numpy.core.multiarray.arange} 16 0.079 0.005 0.079 0.005 {numpy.core.multiarray.concatenate} 483/420 0.077 0.000 0.125 0.000 {numpy.core.multiarray.array} 15 0.054 0.004 0.197 0.013 indexing.py:259(_index_indexer_1d) 3 0.041 0.014 0.043 0.014 netCDF4_.py:181(__init__) 105 0.013 0.000 0.057 0.001 netCDF4_.py:196(open_store_variable) 15 0.012 0.001 0.013 0.001 {operator.getitem} 2715/1665 0.007 0.000 0.178 0.000 indexing.py:343(shape) 5971 0.006 0.000 0.006 0.000 collections.py:71(__setitem__) ``` The version of dask is held constant in each test.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278 https://github.com/pydata/xarray/issues/1301#issuecomment-286212647,https://api.github.com/repos/pydata/xarray/issues/1301,286212647,MDEyOklzc3VlQ29tbWVudDI4NjIxMjY0Nw==,1360241,2017-03-13T19:12:13Z,2017-03-13T19:12:13Z,NONE,"Data: Five files that are approximately 450 MB each. venv1 dask 0.13.0 py27_0 conda-forge xarray 0.8.2 py27_0 conda-forge 1.51642394066 seconds to load using open_mfdataset venv2: dask 0.13.0 py27_0 conda-forge xarray 0.9.1 py27_0 conda-forge 279.011202097 seconds to load using open_mfdataset I ran the same code in the OP on two conda envs with the same version of dask but two different versions of xarray. There was a significant difference in load time between the two conda envs. I've posted the data on my work site if anyone wants to double check: https://marine.rutgers.edu/~michaesm/netcdf/data/ ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278