id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
243927150,MDU6SXNzdWUyNDM5MjcxNTA=,1481,Excessive memory usage when printing multi-file Dataset,29717790,closed,0,,,9,2017-07-19T05:34:25Z,2017-09-22T19:30:03Z,2017-09-22T19:30:03Z,NONE,,,,"I have a dataset comprising 25 output files from the ROMS ocean model. They are netCDF files (""averages"" files in ROMS jargon) containing a number of variables, but most of the storage is devoted to a few time-varying oceanographic variables, either 2D or 3D in space. I have post-processed the files by packing the oceanographic variables to int32 form using the netCDF add_offset and scale_factor attributes. Each file has 100 records in the unlimited dimension (ocean_time) so the complete dataset has 2500 records. The 25 files total 56.8 GiB so would expand to roughly 230 GiB in float64 form.

I open the 25 files with xarray.open_mfdataset, concatenating along the unlimited dimension. This takes a few seconds. I then print() the resulting xarray.Dataset. This takes a few seconds more. All good so far.

But when I vary the number of these files, _n_, that I include in my xarray.Dataset I get surprising and inconvenient results. All works as expected in reasonable time with _n_ <= 8 and with _n_ >= 19. But with 9 <= _n_ <= 18, the interpreter that's processing the code (pythonw.exe via Ipython) consumes steadily more memory until the 12-14 GiB that's available on my machine is exhausted.

The attached script exposes the problem. In this case the file sequence consists of one file name repeated _n_ times. The value of _n_ currently hard-coded into the script is 10. With this value, the final statement in the script--printing the dataset--will exhaust the memory on my PC in about 10 seconds, if I fail to kill the process first.

I have put a copy of the ROMS output file here: 

ftp://ftp.niwa.co.nz/incoming/hadfield/roms_avg_0001.nc

[mgh_example_test_mfdataset.py.txt](https://github.com/pydata/xarray/files/1158093/mgh_example_test_mfdataset.py.txt):
```python
""""""Explore a performance/memory bug relating to multi-file datasets
""""""

import xarray as xr
import os

#%% Specify the list of files. Repeat the same file to get a mult-file dataset

root = os.path.join(
        'D:\\', 'Mirror', 'hpcf', 'working', 'hadfield',
        'work', 'cook', 'roms', 'sim34', 'run',
        'bran-2009-2012-wrfnz-1.20')
file = [os.path.join(root, 'roms_avg_0001.nc')]

file = file*10

print('Number of files:', len(file))

#%% Create a multi-file dataset with the open_mfdataset function.

ds = xr.open_mfdataset(file, concat_dim='ocean_time')

print('The dataset has been successfully opened')

#%% Print a summary

print(ds)

print('The dataset has been printed')
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1481/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue