id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1676561243,I_kwDOAMm_X85j7ktb,7772, Process getting killed due to high memory consumption of xarray's nbytes method,123355381,closed,0,,,6,2023-04-20T11:46:02Z,2023-04-24T10:51:16Z,2023-04-24T10:50:02Z,NONE,,,,"### What is your issue? The `nbytes` method in `xarray` calculates the number of bytes used by a DataArray or Dataset object. However, I have noticed that this method can lead to high memory consumption, especially when dealing with large datasets. When I call `nbytes` on a large dataset, it can take significant time and memory to compute. For example, when I tried to calculate the size of this dataset, the `nbytes` method consumed several gigabytes of memory and took a few minutes to complete. Below is an example of the whole process:: 1. First we create a dataset 2. Observe the memory consumption using the `nbytes` method The code to generate the sample dataset is below: import netCDF4 as nc import numpy as np fn = '~/test_1.nc' ncfile = nc.Dataset(fn, 'w', format='NETCDF4') lat_dim = ncfile.createDimension('lat', 7210) # latitude axis lon_dim = ncfile.createDimension('lon', 7440) # longitude axis time_dim = ncfile.createDimension('time', None) lat = ncfile.createVariable('lat', np.float32, ('lat',)) lon = ncfile.createVariable('lon', np.float32, ('lon',)) time = ncfile.createVariable('time', np.float64, ('time',)) var_1 = ncfile.createVariable('var_1',np.float64,('time','lat','lon'), zlib=True) var_2 = ncfile.createVariable('var_2',np.float64,('time','lat','lon'), zlib=True) var_3 = ncfile.createVariable('var_3',np.float64,('time','lat','lon'), zlib=True) var_4 = ncfile.createVariable('var_4',np.float64,('time','lat','lon'), zlib=True) nlats = len(lat_dim); nlons = len(lon_dim); ntimes = 5 lat[:] = -90. + (180./nlats)*np.arange(nlats) lon[:] = (180./nlats)*np.arange(nlons) data_arr = np.random.uniform(low=100,high=330,size=(ntimes,nlats,nlons)) var_1[:,:,:] = data_arr var_2[:,:,:] = data_arr var_3[:,:,:] = data_arr var_4[:,:,:] = data_arr ncfile.close(); After running the above code(`python .py`), `test_1.nc` is generated with a compressed size of approximately 7.1 gigabytes and an actual size of approximately 8.6 gigabytes. To calculate the memory consumed by `nbytes` method, we can use the following code: ``` import xarray as xa from memory_profiler import profile @profile def get_dataset_size() : dataset = xa.open_dataset(""test_1.nc"") print(dataset.nbytes) if __name__ == ""__main__"": get_dataset_size() ``` ``` 8582842640 Filename: demo.py Line # Mem usage Increment Occurrences Line Contents ============================================================= 4 97.8 MiB 97.8 MiB 1 @profile 5 def get_dataset_size() : 6 160.4 MiB 62.6 MiB 1 dataset = xa.open_dataset(""test_1.nc"") 7 8434.2 MiB 8273.8 MiB 1 print(dataset.nbytes) ``` We can observe that it is taking ~8.6GB RAM which is approx. equal to the actual size of the dataset (returned by the nbytes). **Note: if the machine's memory size is only 8 gigabytes, this process will be killed.** Instead, we can use another method to calculate the size of the file, which will not consume too much memory to compute and provide the same result as the nbytes method: Other Method's code: ``` import xarray as xa from memory_profiler import profile @profile def get_dataset_size() : dataset = xa.open_dataset(""test_1.nc"") print(sum(v.size * v.dtype.itemsize for v in dataset.variables.values())) if __name__ == ""__main__"": get_dataset_size() ``` ``` 8582842640 Filename: demo.py Line # Mem usage Increment Occurrences Line Contents ============================================================= 4 97.8 MiB 97.8 MiB 1 @profile 5 def get_dataset_size() : 6 160.5 MiB 62.7 MiB 1 dataset = xa.open_dataset(""test_1.nc"") 7 160.5 MiB 0.0 MiB 17 print(sum(v.size * v.dtype.itemsize for v in dataset.variables.values())) ``` While exploring the code I found that at [line 396](https://github.com/pydata/xarray/blob/9ff932a564ec1e19918120bab0ec78f1f87df07b/xarray/core/variable.py#L396) it uses `self._data.nbytes` which is causing the memory issue. But in the next lines, it uses the method stated above. So why have that `if` block at line 396?","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7772/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue