id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
1676561243,I_kwDOAMm_X85j7ktb,7772, Process getting killed due to high memory consumption of xarray's nbytes method,123355381,closed,0,,,6,2023-04-20T11:46:02Z,2023-04-24T10:51:16Z,2023-04-24T10:50:02Z,NONE,,,,"### What is your issue?

The `nbytes` method in `xarray` calculates the number of bytes used by a DataArray or Dataset object. However, I have noticed that this method can lead to high memory consumption, especially when dealing with large datasets.

When I call `nbytes` on a large dataset, it can take significant time and memory to compute. For example, when I tried to calculate the size of this dataset, the `nbytes` method consumed several gigabytes of memory and took a few minutes to complete.
Below is an example of the whole process::
 1. First we create a dataset
 2. Observe the memory consumption using the `nbytes` method
 
The code to generate the sample dataset is below:

    import netCDF4 as nc
    import numpy as np

    fn = '~/test_1.nc'
    ncfile = nc.Dataset(fn, 'w', format='NETCDF4')
    
    lat_dim = ncfile.createDimension('lat', 7210) # latitude axis
    lon_dim = ncfile.createDimension('lon', 7440) # longitude axis
    time_dim = ncfile.createDimension('time', None) 
    
    
    lat = ncfile.createVariable('lat', np.float32, ('lat',))
    lon = ncfile.createVariable('lon', np.float32, ('lon',))
    time = ncfile.createVariable('time', np.float64, ('time',))
    
    var_1 = ncfile.createVariable('var_1',np.float64,('time','lat','lon'), zlib=True) 
    var_2 =  ncfile.createVariable('var_2',np.float64,('time','lat','lon'), zlib=True) 
    var_3 =  ncfile.createVariable('var_3',np.float64,('time','lat','lon'), zlib=True) 
    var_4 =  ncfile.createVariable('var_4',np.float64,('time','lat','lon'), zlib=True) 
    
    nlats = len(lat_dim); nlons = len(lon_dim); ntimes = 5
    lat[:] = -90. + (180./nlats)*np.arange(nlats) 
    lon[:] = (180./nlats)*np.arange(nlons) 
    
    data_arr = np.random.uniform(low=100,high=330,size=(ntimes,nlats,nlons))
    var_1[:,:,:] = data_arr 
    var_2[:,:,:] = data_arr 
    var_3[:,:,:] = data_arr 
    var_4[:,:,:] = data_arr 
    
    ncfile.close(); 

After running the above code(`python <file_name>.py`), `test_1.nc` is generated with a compressed size of approximately 7.1 gigabytes and an actual size of approximately 8.6 gigabytes. To calculate the memory consumed by  `nbytes` method, we can use the following code:
```
import xarray as xa
from memory_profiler import profile

@profile
def get_dataset_size() :
    dataset = xa.open_dataset(""test_1.nc"")
    print(dataset.nbytes)
    
if __name__ == ""__main__"":
    get_dataset_size()
```
```
8582842640
Filename: demo.py
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     4     97.8 MiB     97.8 MiB           1   @profile
     5                                         def get_dataset_size() :
     6    160.4 MiB     62.6 MiB           1       dataset = xa.open_dataset(""test_1.nc"")
     7   8434.2 MiB   8273.8 MiB           1       print(dataset.nbytes)
```
We can observe that it is taking ~8.6GB RAM which is approx. equal to the actual size of the dataset (returned by the nbytes).

**Note: if the machine's memory size is only 8 gigabytes, this process will be killed.**

Instead, we can use another method to calculate the size of the file, which will not consume too much memory to compute and provide the same result as the nbytes method:

Other Method's code:
```
import xarray as xa
from memory_profiler import profile

@profile
def get_dataset_size() :
    dataset = xa.open_dataset(""test_1.nc"")
    print(sum(v.size * v.dtype.itemsize for v in dataset.variables.values()))

    
if __name__ == ""__main__"":
    get_dataset_size()
```
```
8582842640
Filename: demo.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     4     97.8 MiB     97.8 MiB           1   @profile
     5                                         def get_dataset_size() :
     6    160.5 MiB     62.7 MiB           1       dataset = xa.open_dataset(""test_1.nc"")
     7    160.5 MiB      0.0 MiB          17       print(sum(v.size * v.dtype.itemsize for v in dataset.variables.values()))
```
While exploring the code I found that at [line 396](https://github.com/pydata/xarray/blob/9ff932a564ec1e19918120bab0ec78f1f87df07b/xarray/core/variable.py#L396) it uses `self._data.nbytes` which is causing the memory issue. But in the next lines, it uses the method stated above.

So why have that `if` block at line 396?","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7772/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue