id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 243927150,MDU6SXNzdWUyNDM5MjcxNTA=,1481,Excessive memory usage when printing multi-file Dataset,29717790,closed,0,,,9,2017-07-19T05:34:25Z,2017-09-22T19:30:03Z,2017-09-22T19:30:03Z,NONE,,,,"I have a dataset comprising 25 output files from the ROMS ocean model. They are netCDF files (""averages"" files in ROMS jargon) containing a number of variables, but most of the storage is devoted to a few time-varying oceanographic variables, either 2D or 3D in space. I have post-processed the files by packing the oceanographic variables to int32 form using the netCDF add_offset and scale_factor attributes. Each file has 100 records in the unlimited dimension (ocean_time) so the complete dataset has 2500 records. The 25 files total 56.8 GiB so would expand to roughly 230 GiB in float64 form. I open the 25 files with xarray.open_mfdataset, concatenating along the unlimited dimension. This takes a few seconds. I then print() the resulting xarray.Dataset. This takes a few seconds more. All good so far. But when I vary the number of these files, _n_, that I include in my xarray.Dataset I get surprising and inconvenient results. All works as expected in reasonable time with _n_ <= 8 and with _n_ >= 19. But with 9 <= _n_ <= 18, the interpreter that's processing the code (pythonw.exe via Ipython) consumes steadily more memory until the 12-14 GiB that's available on my machine is exhausted. The attached script exposes the problem. In this case the file sequence consists of one file name repeated _n_ times. The value of _n_ currently hard-coded into the script is 10. With this value, the final statement in the script--printing the dataset--will exhaust the memory on my PC in about 10 seconds, if I fail to kill the process first. I have put a copy of the ROMS output file here: ftp://ftp.niwa.co.nz/incoming/hadfield/roms_avg_0001.nc [mgh_example_test_mfdataset.py.txt](https://github.com/pydata/xarray/files/1158093/mgh_example_test_mfdataset.py.txt): ```python """"""Explore a performance/memory bug relating to multi-file datasets """""" import xarray as xr import os #%% Specify the list of files. Repeat the same file to get a mult-file dataset root = os.path.join( 'D:\\', 'Mirror', 'hpcf', 'working', 'hadfield', 'work', 'cook', 'roms', 'sim34', 'run', 'bran-2009-2012-wrfnz-1.20') file = [os.path.join(root, 'roms_avg_0001.nc')] file = file*10 print('Number of files:', len(file)) #%% Create a multi-file dataset with the open_mfdataset function. ds = xr.open_mfdataset(file, concat_dim='ocean_time') print('The dataset has been successfully opened') #%% Print a summary print(ds) print('The dataset has been printed') ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1481/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue