home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 520182139

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/3200#issuecomment-520182139 https://api.github.com/repos/pydata/xarray/issues/3200 520182139 MDEyOklzc3VlQ29tbWVudDUyMDE4MjEzOQ== 1217238 2019-08-10T21:51:25Z 2019-08-10T21:52:24Z MEMBER

Thanks for the profiling script. I ran a few permutations of this: - xarray.open_mfdataset with engine='netcdf4' (default) - xarray.open_mfdataset with engine='h5netcdf' - xarray.open_dataset with engine='netcdf4' (default) - xarray.open_dataset with engine='h5netcdf'

Here are some plots:

xarray.open_mfdataset with engine='netcdf4': pretty noticeable memory leak, about 0.5 MB / open_mfdataset call:

xarray.open_mfdataset with engine='h5netcdf': looks like a small memory leak, about 0.1 MB / open_mfdataset call:

xarray.open_dataset with engine='netcdf4' (default): definitely has a memory leak:

xarray.open_dataset with engine='h5netcdf': does not appear to have a memory leak:

So in conclusion, it looks like there are memory leaks: 1. when using netCDF4-Python (I was also able to confirm these without using xarray at all, just using netCDF4.Dataset) 2. when using xarray.open_mfdataset

(1) looks like by far the bigger issue, which you can work around by switching to scipy or h5netcdf to read your files.

(2) is an issue for xarray. We do do some caching, specifically with our backend file manager, but given that issues only seem to appear when using open_mfdataset, I suspect it may have more to do with the interaction with Dask, though to be honest I'm not exactly sure how.

Note: I modified your script to xarray's file cache size to 1, which helps smooth out the memory usage: ```python def CreateTestFiles(): # create a bunch of files xlen = int(1e2) ylen = int(1e2) xdim = np.arange(xlen) ydim = np.arange(ylen)

    nfiles = 100
    for i in range(nfiles):
            data = np.random.rand(xlen, ylen, 1)
            datafile = xr.DataArray(data, coords=[xdim, ydim, [i]], dims=['x', 'y', 'time'])
            datafile.to_netcdf('testfile_{}.nc'.format(i))

@profile def ReadFiles(): # for i in range(100): # ds = xr.open_dataset('testfile_{}.nc'.format(i), engine='netcdf4') # ds.close() ds = xr.open_mfdataset(glob.glob('testfile_*'), engine='h5netcdf', concat_dim='time') ds.close()

if name == 'main': # write out files for testing CreateTestFiles()

    xr.set_options(file_cache_maxsize=1)

    # loop thru file read step
    for i in range(100):
            ReadFiles()

```

{
    "total_count": 2,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  479190812
Powered by Datasette · Queries took 1.791ms · About: xarray-datasette