issues: 243927150
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
243927150 | MDU6SXNzdWUyNDM5MjcxNTA= | 1481 | Excessive memory usage when printing multi-file Dataset | 29717790 | closed | 0 | 9 | 2017-07-19T05:34:25Z | 2017-09-22T19:30:03Z | 2017-09-22T19:30:03Z | NONE | I have a dataset comprising 25 output files from the ROMS ocean model. They are netCDF files ("averages" files in ROMS jargon) containing a number of variables, but most of the storage is devoted to a few time-varying oceanographic variables, either 2D or 3D in space. I have post-processed the files by packing the oceanographic variables to int32 form using the netCDF add_offset and scale_factor attributes. Each file has 100 records in the unlimited dimension (ocean_time) so the complete dataset has 2500 records. The 25 files total 56.8 GiB so would expand to roughly 230 GiB in float64 form. I open the 25 files with xarray.open_mfdataset, concatenating along the unlimited dimension. This takes a few seconds. I then print() the resulting xarray.Dataset. This takes a few seconds more. All good so far. But when I vary the number of these files, n, that I include in my xarray.Dataset I get surprising and inconvenient results. All works as expected in reasonable time with n <= 8 and with n >= 19. But with 9 <= n <= 18, the interpreter that's processing the code (pythonw.exe via Ipython) consumes steadily more memory until the 12-14 GiB that's available on my machine is exhausted. The attached script exposes the problem. In this case the file sequence consists of one file name repeated n times. The value of n currently hard-coded into the script is 10. With this value, the final statement in the script--printing the dataset--will exhaust the memory on my PC in about 10 seconds, if I fail to kill the process first. I have put a copy of the ROMS output file here: ftp://ftp.niwa.co.nz/incoming/hadfield/roms_avg_0001.nc mgh_example_test_mfdataset.py.txt: ```python """Explore a performance/memory bug relating to multi-file datasets """ import xarray as xr import os %% Specify the list of files. Repeat the same file to get a mult-file datasetroot = os.path.join( 'D:\', 'Mirror', 'hpcf', 'working', 'hadfield', 'work', 'cook', 'roms', 'sim34', 'run', 'bran-2009-2012-wrfnz-1.20') file = [os.path.join(root, 'roms_avg_0001.nc')] file = file*10 print('Number of files:', len(file)) %% Create a multi-file dataset with the open_mfdataset function.ds = xr.open_mfdataset(file, concat_dim='ocean_time') print('The dataset has been successfully opened') %% Print a summaryprint(ds) print('The dataset has been printed') ``` |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/1481/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |