id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1525546857,I_kwDOAMm_X85a7f9p,7429,Training on xarray files leads to CPU memory leak (PyTorch),7348840,closed,0,,,2,2023-01-09T12:57:23Z,2023-01-13T13:17:43Z,2023-01-13T13:17:42Z,NONE,,,,"### What happened? ## Description At each training batch, CPU memory increases a bit until I run out of memory (total RAM: 376 GB). After that I cannot even ssh into the machine jupyter notebook was served (nor see any errors caused). I cannot understand where or why memory is being cached forever in CPU: my training is done on GPU. I have written a minimal reproduction code, which I share below. Two dataset versions are tested: one uses xarray and the other, numpy npz files. The bug is only reproduced using the xarray dataset. ## Setup The cluster I use is managed by SLURM and it uses a lustre filesystem. Training is performed using an NVidia GPU. My Python 3.8.15 is an official Debian Bullseye docker image, which I've pulled and now use via Singularity (must, as my cluster does not allow docker directly). ## Dependencies numpy==1.22.4 (also tested on 1.23.5) xarray==2022.11.0 (also tested on 2022.12.0) h5netcdf==1.0.2 (also tested on 1.1.0) torch==1.13.0 (also tested on 1.13.1) ## Current workaround One workaround we use is using more workers to load data. So the memory is forced to be freed when the epoch ends because the threads are dead, I suppose. So for a while I can force that less batches are trained per epoch and the leak is controlled. **Issue based on [Unexpected eternal CPU RAM growth during training #16227](https://github.com/Lightning-AI/lightning/issues/16227#top)** ### What did you expect to happen? Reading xarray files should keep data until data is read and the reader is closed. For some reason, data seems to be maintained cached somewhere. ### Minimal Complete Verifiable Example ```Python # Imports from pathlib import Path import psutil import numpy as np import xarray as xr import torch data_dir = Path.cwd() / ""data"" # Defining equivalent XArray and NPZ datasets class BaseDataset(torch.utils.data.Dataset): def __init__(self, data_dir=None, transform=None, shape=(1, 128, 128, 128), size=1000): self.data_dir = Path(data_dir) self.transform = transform self.shape = shape self.size = size self.prepared = False def __len__(self): return self.size def get_fake_sample(self): x = np.random.normal(size=self.shape).astype(np.float32) y = (x > .7).astype(np.int8) return {""x"": x, ""y"": y} class XarrayDataset(BaseDataset): def __getitem__(self, idx): ds = xr.open_dataset(self.data_path) sample = {""x"": torch.as_tensor(ds[""x""].data), ""y"": torch.as_tensor(ds[""y""].data)} ds.close() if self.transform: sample = self.transform(sample) return sample @property def data_path(self): return self.data_dir / ""data.nc"" def prepare_data(self): if self.data_path.exists(): return self.data_dir.mkdir(exist_ok=True) sample = self.get_fake_sample() ds = xr.Dataset({ var: xr.DataArray(arr) for var, arr in sample.items() }) ds.to_netcdf(self.data_path) ds.close() class NpzDataset(BaseDataset): def __getitem__(self, idx): npz = np.load(self.data_path) sample = {""x"": torch.as_tensor(npz[""x""]), ""y"": torch.as_tensor(npz[""y""])} if self.transform: sample = self.transform(sample) return sample @property def data_path(self): return self.data_dir / ""data.npz"" def prepare_data(self): if self.data_path.exists(): return self.data_dir.mkdir(exist_ok=True) sample = self.get_fake_sample() np.savez_compressed(self.data_path, **sample) class ComplicatedTransform(): def __init__(self, concat_operations=1): self.concat_operations = concat_operations def __call__(self, sample): x = sample[""x""] for _ in range(self.concat_operations): x = torch.cat([x, x**2]) sample[""x""] = x return sample # Prepare training ChosenDataset = XarrayDataset # ChosenDataset = NpzDataset max_epochs = 10 concat_operations = 4 dataset = ChosenDataset( data_dir=data_dir, transform=ComplicatedTransform(concat_operations), ) dataset.prepare_data() loader = torch.utils.data.DataLoader( dataset, batch_size=64, num_workers=0, drop_last=True, shuffle=True, ) loss_fn = torch.nn.CrossEntropyLoss() model = SimpleModel = torch.nn.Sequential( torch.nn.LazyConv3d(out_channels=1, kernel_size=1), torch.nn.Sigmoid(), ) optimizer = torch.optim.Adam(model.parameters(), lr=1e-2) # Train device = ""cuda"" print(f""## {ChosenDataset.__name__}: {concat_operations=}"", end=""\n\n"") print(f""| epoch | memory (GB) |"") print(f""|-------|-------------|"") model = model.to(device) for epoch in range(max_epochs): memory = psutil.Process().memory_info().rss / (1024 **3) # GB print(f""| {epoch} | {memory:.3f} |"") for batch in loader: X = batch[""x""].to(device) Y = batch[""y""].to(device) optimizer.zero_grad() Y_pred_proba = model(X) loss = loss_fn(Y_pred_proba, Y.to(torch.float16)) loss.backward() optimizer.step() del X del Y ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [x] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. ### Relevant log output _No response_ ### Anything else we need to know? Using the Xarray dataset and `concat_operations=1`, I could see memory growth of ~16 GB per epoch. With `concat_operations=4`, ~30 GB per epoch. With no `concat_operations`, no memory growth. Using the NPZ dataset, I could reproduce no RAM accumulation along epochs. ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 (main, Aug 23 2022, 08:25:41) [GCC 10.2.1 20210110] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.49.1.el7.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.0 xarray: 2022.11.0 pandas: 1.4.3 numpy: 1.23.4 scipy: 1.8.1 netCDF4: 1.6.1 pydap: None h5netcdf: 1.0.2 h5py: 3.7.0 Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.10.2 distributed: None matplotlib: 3.6.2 cartopy: None seaborn: None numbagg: None fsspec: 2022.7.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 63.2.0 pip: 22.3.1 conda: None pytest: None IPython: 8.6.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7429/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue