home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1506437087

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1506437087 I_kwDOAMm_X85Zymff 7397 Memory issue merging NetCDF files using xarray.open_mfdataset and to_netcdf 720460 open 0     9 2022-12-21T15:00:05Z 2023-09-16T12:42:51Z   NONE      

What happened?

I have 5 NetCDF files (1 GiB each). They have 4 dimensions: time, depth, lat, lon. All the files have exactly the same depth, lat, lon. The time axis have the same interval and there are no gaps on this axis for all the 5 files (and there is continuity in the axis between files).

All I am doing is merging the files along the time-axis and saving it to a new NetCDF file.

Running the script, I allocated 185 GiB of memory (the maximum in my cluster).

The program runs until the to_netcdf() function is called. I get an error stating there is not enough memory.

What did you expect to happen?

As the 5 files are 1 GiB each, and I allocated 185 GiB (far more than 5² GiB), I expected the program to run and not require more than the allocated memory (after all, I gave 37 times the combined size of the files).

Minimal Complete Verifiable Example

```Python path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc data = xr.open_mfdataset(path)

data = data.load() # uses 5 GiB - tested with a memory profiler

data.to_netcdf('./output/combined.nc') # ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Python Traceback (most recent call last): File "/users/me/code/par2.py", line 78, in <module> preprocess_data(year, month) File "/users/me/code/par2.py", line 69, in preprocess_data data.to_netcdf(path=outpath) File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/core/dataset.py", line 1882, in to_netcdf return to_netcdf( # type: ignore # mypy cannot resolve the overloads:( File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/api.py", line 1210, in to_netcdf dump_to_store( File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/api.py", line 1257, in dump_to_store store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/common.py", line 263, in store variables, attributes = self.encode(variables, attributes) File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/common.py", line 352, in encode variables, attributes = cf_encoder(variables, attributes) File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/conventions.py", line 864, in cf_encoder new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/conventions.py", line 864, in <dictcomp> new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/conventions.py", line 273, in encode_cf_variable var = coder.encode(var, name=name) File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/coding/variables.py", line 170, in encode data = duck_array_ops.fillna(data, fill_value) File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 283, in fillna return where(notnull(data), data, other) File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 270, in where return _where(condition, *as_shared_dtype([x, y])) File "<__array_function__ internals>", line 180, in where numpy.core._exceptions._ArrayMemoryError: Unable to allocate 43.6 GiB for an array with shape (280, 200, 277, 754) and data type float32

Anything else we need to know?

I allocated 185 GiB for this job, from my understanding, this means that merging 5 datasets with 1 GiB each requires more than 185 GiB memory. It sounds like a memory leak to me.

I am not the only one with this issue, cf: https://github.com/pydata/xarray/discussions/4890

Environment

/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 4.18.0-372.26.1.el8_6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2022.6.0 pandas: 1.4.4 numpy: 1.23.2 scipy: 1.9.1 netCDF4: 1.6.0 pydap: None h5netcdf: None h5py: 3.7.0 Nio: None zarr: None cftime: 1.6.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.9.0 distributed: 2022.9.0 matplotlib: 3.5.3 cartopy: None seaborn: 0.12.0 numbagg: None fsspec: 2022.8.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.3.0 pip: 22.2.2 conda: None pytest: 7.1.3 IPython: 7.33.0 sphinx: 5.1.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7397/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 4 rows from issues_id in issues_labels
  • 8 rows from issue in issue_comments
Powered by Datasette · Queries took 0.911ms · About: xarray-datasette