issues: 326533369
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
326533369 | MDU6SXNzdWUzMjY1MzMzNjk= | 2186 | Memory leak while looping through a Dataset | 12929327 | closed | 0 | 13 | 2018-05-25T13:53:31Z | 2022-03-08T10:00:07Z | 2019-01-14T21:09:36Z | NONE | I'm encountering a detrimental memory leak when simply accessing data from a Dataset repeatedly within a loop. I'm opening netCDF files concatenated in time, and looping through time to create plots. In this case the x-y slices are about 5000 x 5000 in size ```python import xarray as xr import os, psutil ds = xr.open_mfdataset('*.nc', chunks={'x': 4000, 'y': 4000}, concat_dim='t') for k in range(ds.dims['t']): data = ds.datavar[k,:,:].values print('memory=', process.memory_info().rss)
Strangely, in this simplified example I can greatly reduce the memory growth by using much smaller chunk sizes, but in my real-world example, opening all data with smaller chunk sizes does not mitigate the problem. Either way, it's not clear to me why the memory usage should grow for any chunk size at all. ```python ds = xr.open_mfdataset('*.nc', chunks={'x': 1000, 'y': 1000}, concat_dim='t')
I can also generate memory growth when cutting dask out entirely with ```python ds = xr.open_dataset('data.nc', chunks=None) # x-y dataset 5424 x 5424 for var in ['var1', 'var2', ... , 'var15']: data = ds[var].values print('memory =', process.memory_info().rss)
Though you can see that, strangely, the growth stops after several iterations. This isn't always true. Sometimes it asymptotes for a few interations and then begins growing again. I feel like I'm missing something fundamental about xarray memory management. It seems like a great impediment that arrays (or something) read from a Dataset are not garbage collected while looping through that Dataset, which kind of defeats the purpose of only accessing and working with the data you need in the first place. I have to access rather large chunks of data at a time, so being able to discard that slice of data and move onto the next one without filling up the RAM is a big deal. Any ideas what's going on? Or what I'm missing?
print(ds) # from open_mfdataset()
<xarray.Dataset>
Dimensions: (band: 1, number_of_image_bounds: 2, number_of_time_bounds: 2, t: 4, x: 5424, y: 5424)
Coordinates:
* y (y) float32 0.151844 0.151788 ...
* x (x) float32 -0.151844 -0.151788 ...
* t (t) datetime64[ns] 2018-05-25T00:36:02.796268032 ...
Data variables:
data (t, y, x) float32 dask.array<shape=(4, 5424, 5424), chunksize=(1, 4000, 4000)>
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.16.8-300.fc28.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
xarray: 0.10.4
pandas: 0.22.0
numpy: 1.14.3
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.5.1
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.4
distributed: 1.21.8
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: None
setuptools: 39.1.0
pip: 10.0.1
conda: None
pytest: None
IPython: 6.4.0
sphinx: None
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/2186/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |