home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 243927150

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
243927150 MDU6SXNzdWUyNDM5MjcxNTA= 1481 Excessive memory usage when printing multi-file Dataset 29717790 closed 0     9 2017-07-19T05:34:25Z 2017-09-22T19:30:03Z 2017-09-22T19:30:03Z NONE      

I have a dataset comprising 25 output files from the ROMS ocean model. They are netCDF files ("averages" files in ROMS jargon) containing a number of variables, but most of the storage is devoted to a few time-varying oceanographic variables, either 2D or 3D in space. I have post-processed the files by packing the oceanographic variables to int32 form using the netCDF add_offset and scale_factor attributes. Each file has 100 records in the unlimited dimension (ocean_time) so the complete dataset has 2500 records. The 25 files total 56.8 GiB so would expand to roughly 230 GiB in float64 form.

I open the 25 files with xarray.open_mfdataset, concatenating along the unlimited dimension. This takes a few seconds. I then print() the resulting xarray.Dataset. This takes a few seconds more. All good so far.

But when I vary the number of these files, n, that I include in my xarray.Dataset I get surprising and inconvenient results. All works as expected in reasonable time with n <= 8 and with n >= 19. But with 9 <= n <= 18, the interpreter that's processing the code (pythonw.exe via Ipython) consumes steadily more memory until the 12-14 GiB that's available on my machine is exhausted.

The attached script exposes the problem. In this case the file sequence consists of one file name repeated n times. The value of n currently hard-coded into the script is 10. With this value, the final statement in the script--printing the dataset--will exhaust the memory on my PC in about 10 seconds, if I fail to kill the process first.

I have put a copy of the ROMS output file here:

ftp://ftp.niwa.co.nz/incoming/hadfield/roms_avg_0001.nc

mgh_example_test_mfdataset.py.txt: ```python """Explore a performance/memory bug relating to multi-file datasets """

import xarray as xr import os

%% Specify the list of files. Repeat the same file to get a mult-file dataset

root = os.path.join( 'D:\', 'Mirror', 'hpcf', 'working', 'hadfield', 'work', 'cook', 'roms', 'sim34', 'run', 'bran-2009-2012-wrfnz-1.20') file = [os.path.join(root, 'roms_avg_0001.nc')]

file = file*10

print('Number of files:', len(file))

%% Create a multi-file dataset with the open_mfdataset function.

ds = xr.open_mfdataset(file, concat_dim='ocean_time')

print('The dataset has been successfully opened')

%% Print a summary

print(ds)

print('The dataset has been printed') ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1481/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 9 rows from issue in issue_comments
Powered by Datasette · Queries took 0.979ms · About: xarray-datasette