home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1676561243

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1676561243 I_kwDOAMm_X85j7ktb 7772 Process getting killed due to high memory consumption of xarray's nbytes method 123355381 closed 0     6 2023-04-20T11:46:02Z 2023-04-24T10:51:16Z 2023-04-24T10:50:02Z NONE      

What is your issue?

The nbytes method in xarray calculates the number of bytes used by a DataArray or Dataset object. However, I have noticed that this method can lead to high memory consumption, especially when dealing with large datasets.

When I call nbytes on a large dataset, it can take significant time and memory to compute. For example, when I tried to calculate the size of this dataset, the nbytes method consumed several gigabytes of memory and took a few minutes to complete. Below is an example of the whole process:: 1. First we create a dataset 2. Observe the memory consumption using the nbytes method

The code to generate the sample dataset is below:

import netCDF4 as nc
import numpy as np

fn = '~/test_1.nc'
ncfile = nc.Dataset(fn, 'w', format='NETCDF4')

lat_dim = ncfile.createDimension('lat', 7210) # latitude axis
lon_dim = ncfile.createDimension('lon', 7440) # longitude axis
time_dim = ncfile.createDimension('time', None)


lat = ncfile.createVariable('lat', np.float32, ('lat',))
lon = ncfile.createVariable('lon', np.float32, ('lon',))
time = ncfile.createVariable('time', np.float64, ('time',))

var_1 = ncfile.createVariable('var_1',np.float64,('time','lat','lon'), zlib=True) 
var_2 =  ncfile.createVariable('var_2',np.float64,('time','lat','lon'), zlib=True) 
var_3 =  ncfile.createVariable('var_3',np.float64,('time','lat','lon'), zlib=True) 
var_4 =  ncfile.createVariable('var_4',np.float64,('time','lat','lon'), zlib=True)

nlats = len(lat_dim); nlons = len(lon_dim); ntimes = 5
lat[:] = -90. + (180./nlats)*np.arange(nlats) 
lon[:] = (180./nlats)*np.arange(nlons)

data_arr = np.random.uniform(low=100,high=330,size=(ntimes,nlats,nlons))
var_1[:,:,:] = data_arr 
var_2[:,:,:] = data_arr 
var_3[:,:,:] = data_arr 
var_4[:,:,:] = data_arr

ncfile.close();

After running the above code(python <file_name>.py), test_1.nc is generated with a compressed size of approximately 7.1 gigabytes and an actual size of approximately 8.6 gigabytes. To calculate the memory consumed by nbytes method, we can use the following code: ``` import xarray as xa from memory_profiler import profile

@profile def get_dataset_size() : dataset = xa.open_dataset("test_1.nc") print(dataset.nbytes)

if name == "main": get_dataset_size() 8582842640 Filename: demo.py Line # Mem usage Increment Occurrences Line Contents ============================================================= 4 97.8 MiB 97.8 MiB 1 @profile 5 def get_dataset_size() : 6 160.4 MiB 62.6 MiB 1 dataset = xa.open_dataset("test_1.nc") 7 8434.2 MiB 8273.8 MiB 1 print(dataset.nbytes) ``` We can observe that it is taking ~8.6GB RAM which is approx. equal to the actual size of the dataset (returned by the nbytes).

Note: if the machine's memory size is only 8 gigabytes, this process will be killed.

Instead, we can use another method to calculate the size of the file, which will not consume too much memory to compute and provide the same result as the nbytes method:

Other Method's code: ``` import xarray as xa from memory_profiler import profile

@profile def get_dataset_size() : dataset = xa.open_dataset("test_1.nc") print(sum(v.size * v.dtype.itemsize for v in dataset.variables.values()))

if name == "main": get_dataset_size() 8582842640 Filename: demo.py

Line # Mem usage Increment Occurrences Line Contents

 4     97.8 MiB     97.8 MiB           1   @profile
 5                                         def get_dataset_size() :
 6    160.5 MiB     62.7 MiB           1       dataset = xa.open_dataset("test_1.nc")
 7    160.5 MiB      0.0 MiB          17       print(sum(v.size * v.dtype.itemsize for v in dataset.variables.values()))

`` While exploring the code I found that at [line 396](https://github.com/pydata/xarray/blob/9ff932a564ec1e19918120bab0ec78f1f87df07b/xarray/core/variable.py#L396) it usesself._data.nbytes` which is causing the memory issue. But in the next lines, it uses the method stated above.

So why have that if block at line 396?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7772/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 6 rows from issue in issue_comments
Powered by Datasette · Queries took 0.639ms · About: xarray-datasette