id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1506437087,I_kwDOAMm_X85Zymff,7397,Memory issue merging NetCDF files using xarray.open_mfdataset and to_netcdf,720460,open,0,,,9,2022-12-21T15:00:05Z,2023-09-16T12:42:51Z,,NONE,,,,"### What happened? I have 5 NetCDF files (1 GiB each). They have 4 dimensions: time, depth, lat, lon. All the files have exactly the same depth, lat, lon. The time axis have the same interval and there are no gaps on this axis for all the 5 files (and there is continuity in the axis between files). All I am doing is merging the files along the time-axis and saving it to a new NetCDF file. Running the script, I allocated 185 GiB of memory (the maximum in my cluster). The program runs until the to_netcdf() function is called. I get an error stating there is not enough memory. ### What did you expect to happen? As the 5 files are 1 GiB each, and I allocated 185 GiB (far more than 5² GiB), I expected the program to run and not require more than the allocated memory (after all, I gave 37 times the combined size of the files). ### Minimal Complete Verifiable Example ```Python path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc data = xr.open_mfdataset(path) data = data.load() # uses 5 GiB - tested with a memory profiler data.to_netcdf('./output/combined.nc') # ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [ ] Complete example — the example is self-contained, including all data and the text of any traceback. - [ ] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. ### Relevant log output ```Python Traceback (most recent call last): File ""/users/me/code/par2.py"", line 78, in preprocess_data(year, month) File ""/users/me/code/par2.py"", line 69, in preprocess_data data.to_netcdf(path=outpath) File ""/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/core/dataset.py"", line 1882, in to_netcdf return to_netcdf( # type: ignore # mypy cannot resolve the overloads:( File ""/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/api.py"", line 1210, in to_netcdf dump_to_store( File ""/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/api.py"", line 1257, in dump_to_store store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) File ""/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/common.py"", line 263, in store variables, attributes = self.encode(variables, attributes) File ""/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/common.py"", line 352, in encode variables, attributes = cf_encoder(variables, attributes) File ""/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/conventions.py"", line 864, in cf_encoder new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} File ""/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/conventions.py"", line 864, in new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} File ""/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/conventions.py"", line 273, in encode_cf_variable var = coder.encode(var, name=name) File ""/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/coding/variables.py"", line 170, in encode data = duck_array_ops.fillna(data, fill_value) File ""/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/core/duck_array_ops.py"", line 283, in fillna return where(notnull(data), data, other) File ""/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/core/duck_array_ops.py"", line 270, in where return _where(condition, *as_shared_dtype([x, y])) File ""<__array_function__ internals>"", line 180, in where numpy.core._exceptions._ArrayMemoryError: Unable to allocate 43.6 GiB for an array with shape (280, 200, 277, 754) and data type float32 ``` ### Anything else we need to know? I allocated 185 GiB for this job, from my understanding, this means that merging 5 datasets with 1 GiB each requires more than 185 GiB memory. It sounds like a memory leak to me. I am not the only one with this issue, cf: https://github.com/pydata/xarray/discussions/4890 ### Environment
/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn(""Setuptools is replacing distutils."") INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 4.18.0-372.26.1.el8_6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2022.6.0 pandas: 1.4.4 numpy: 1.23.2 scipy: 1.9.1 netCDF4: 1.6.0 pydap: None h5netcdf: None h5py: 3.7.0 Nio: None zarr: None cftime: 1.6.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.9.0 distributed: 2022.9.0 matplotlib: 3.5.3 cartopy: None seaborn: 0.12.0 numbagg: None fsspec: 2022.8.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.3.0 pip: 22.2.2 conda: None pytest: 7.1.3 IPython: 7.33.0 sphinx: 5.1.1
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7397/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue