issues: 1676561243

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1676561243	I_kwDOAMm_X85j7ktb	7772	Process getting killed due to high memory consumption of xarray's nbytes method	123355381	closed	0			6	2023-04-20T11:46:02Z	2023-04-24T10:51:16Z	2023-04-24T10:50:02Z	NONE				What is your issue? The `nbytes` method in `xarray` calculates the number of bytes used by a DataArray or Dataset object. However, I have noticed that this method can lead to high memory consumption, especially when dealing with large datasets. When I call `nbytes` on a large dataset, it can take significant time and memory to compute. For example, when I tried to calculate the size of this dataset, the `nbytes` method consumed several gigabytes of memory and took a few minutes to complete. Below is an example of the whole process:: 1. First we create a dataset 2. Observe the memory consumption using the `nbytes` method The code to generate the sample dataset is below: import netCDF4 as nc import numpy as np fn = '~/test_1.nc' ncfile = nc.Dataset(fn, 'w', format='NETCDF4') lat_dim = ncfile.createDimension('lat', 7210) # latitude axis lon_dim = ncfile.createDimension('lon', 7440) # longitude axis time_dim = ncfile.createDimension('time', None) lat = ncfile.createVariable('lat', np.float32, ('lat',)) lon = ncfile.createVariable('lon', np.float32, ('lon',)) time = ncfile.createVariable('time', np.float64, ('time',)) var_1 = ncfile.createVariable('var_1',np.float64,('time','lat','lon'), zlib=True) var_2 = ncfile.createVariable('var_2',np.float64,('time','lat','lon'), zlib=True) var_3 = ncfile.createVariable('var_3',np.float64,('time','lat','lon'), zlib=True) var_4 = ncfile.createVariable('var_4',np.float64,('time','lat','lon'), zlib=True) nlats = len(lat_dim); nlons = len(lon_dim); ntimes = 5 lat[:] = -90. + (180./nlats)np.arange(nlats) lon[:] = (180./nlats)np.arange(nlons) data_arr = np.random.uniform(low=100,high=330,size=(ntimes,nlats,nlons)) var_1[:,:,:] = data_arr var_2[:,:,:] = data_arr var_3[:,:,:] = data_arr var_4[:,:,:] = data_arr ncfile.close(); After running the above code(`python <file_name>.py`), `test_1.nc` is generated with a compressed size of approximately 7.1 gigabytes and an actual size of approximately 8.6 gigabytes. To calculate the memory consumed by `nbytes` method, we can use the following code: ``` import xarray as xa from memory_profiler import profile @profile def get_dataset_size() : dataset = xa.open_dataset("test_1.nc") print(dataset.nbytes) if name == "main": get_dataset_size() 8582842640 Filename: demo.py Line # Mem usage Increment Occurrences Line Contents ============================================================= 4 97.8 MiB 97.8 MiB 1 @profile 5 def get_dataset_size() : 6 160.4 MiB 62.6 MiB 1 dataset = xa.open_dataset("test_1.nc") 7 8434.2 MiB 8273.8 MiB 1 print(dataset.nbytes) ``` We can observe that it is taking ~8.6GB RAM which is approx. equal to the actual size of the dataset (returned by the nbytes). Note: if the machine's memory size is only 8 gigabytes, this process will be killed. Instead, we can use another method to calculate the size of the file, which will not consume too much memory to compute and provide the same result as the nbytes method: Other Method's code: ``` import xarray as xa from memory_profiler import profile @profile def get_dataset_size() : dataset = xa.open_dataset("test_1.nc") print(sum(v.size * v.dtype.itemsize for v in dataset.variables.values())) if name == "main": get_dataset_size() 8582842640 Filename: demo.py Line # Mem usage Increment Occurrences Line Contents `4 97.8 MiB 97.8 MiB 1 @profile 5 def get_dataset_size() : 6 160.5 MiB 62.7 MiB 1 dataset = xa.open_dataset("test_1.nc") 7 160.5 MiB 0.0 MiB 17 print(sum(v.size * v.dtype.itemsize for v in dataset.variables.values()))` `` While exploring the code I found that at [line 396](https://github.com/pydata/xarray/blob/9ff932a564ec1e19918120bab0ec78f1f87df07b/xarray/core/variable.py#L396) it usesself._data.nbytes` which is causing the memory issue. But in the next lines, it uses the method stated above. So why have that `if` block at line 396?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7772/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

1 row from issues_id in issues_labels
6 rows from issue in issue_comments