html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/7397#issuecomment-1363988341,https://api.github.com/repos/pydata/xarray/issues/7397,1363988341,IC_kwDOAMm_X85RTM91,720460,2022-12-23T14:15:25Z,2022-12-23T14:15:53Z,NONE,"Because I want to have a worry-free holidays, I wrote a bit of code that basically creates a new NetCDF file from scratch. I load the data from Xarray, change the data to Numpy arrays and use the NetCDF4 library to write the files (does what I want).

In the process, I also slice the data and drop unwanted variables to keep just the bits I want (unlike my original post).

If I call .load() or .compute() on my xarray variable, the memory goes crazy (even if I am dropping unwanted variables - which I would expect to release memory). The same happens for slicing followed by .compute().

Unfortunately, the MCVE will have to wait until I am back from my holidays. 

Happy holidays to all!","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362583979,https://api.github.com/repos/pydata/xarray/issues/7397,1362583979,IC_kwDOAMm_X85RN2Gr,720460,2022-12-22T09:04:17Z,2022-12-22T09:04:17Z,NONE,"By the way, prior to writing this ticket, I also did the following (which did not help):
Drop variables I do not care, keeping dimensions only and toce + soce ; I would expect to need less memory after that.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362564754,https://api.github.com/repos/pydata/xarray/issues/7397,1362564754,IC_kwDOAMm_X85RNxaS,720460,2022-12-22T08:44:06Z,2022-12-22T08:44:06Z,NONE,"Answering to the question 'Did you do some processing with the data, changing attributes/encoding etc?':
No processing. I do ask xarray to load the data (and I tried also loading + computing) and the final outcome is the same.

I try now to do an MCVE with dummy data.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362562275,https://api.github.com/repos/pydata/xarray/issues/7397,1362562275,IC_kwDOAMm_X85RNwzj,720460,2022-12-22T08:41:21Z,2022-12-22T08:41:21Z,NONE,"Just tested with to_zarr and it goes through:

```
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:07:55
CPU Efficiency: 63.00% of 00:12:34 core-walltime
Job Wall-clock time: 00:06:17
Memory Utilized: 164.89 GB
Memory Efficiency: 44.56% of 370.00 GB
```

I did an extra run using a memory profiler as such:

```
import xarray as xr
import zarr
from memory_profiler import profile

@profile
def main():
    path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc
    data = xr.open_mfdataset(path)

    data = data.load()
    data = data.compute()

    data.to_zarr()

if __name__=='__main__':
    main()
```
The profiled code was also completed with great success:
```
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:07:52
CPU Efficiency: 63.61% of 00:12:22 core-walltime
Job Wall-clock time: 00:06:11
Memory Utilized: 165.53 GB
Memory Efficiency: 44.74% of 370.00 GB
```

Here is the outcome for the memory profiling:
```
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    5    156.9 MiB    156.9 MiB           1   @profile
    6                                         def main():
    7    156.9 MiB      0.0 MiB           1       path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc

    8    209.3 MiB     52.4 MiB           1       data = xr.open_mfdataset(path)
    9                                         
    10  82150.1 MiB  81940.8 MiB           1       data = data.load()
    11  82101.2 MiB    -49.0 MiB           1       data = data.compute()
    12                                         
    13  90091.2 MiB   7990.0 MiB           1       data.to_zarr()
```
PS: in this test I just realized I loaded 8 files instead of 5.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362544813,https://api.github.com/repos/pydata/xarray/issues/7397,1362544813,IC_kwDOAMm_X85RNsit,720460,2022-12-22T08:21:31Z,2022-12-22T08:21:31Z,NONE,"A single file (from ncdump -h):

```
dimensions:
	axis_nbounds = 2 ;
	x = 754 ;
	y = 277 ;
	deptht = 200 ;
	time_counter = UNLIMITED ; // (28 currently)
variables:
	float nav_lat(y, x) ;
		nav_lat:standard_name = ""latitude"" ;
		nav_lat:long_name = ""Latitude"" ;
		nav_lat:units = ""degrees_north"" ;
	float nav_lon(y, x) ;
		nav_lon:standard_name = ""longitude"" ;
		nav_lon:long_name = ""Longitude"" ;
		nav_lon:units = ""degrees_east"" ;
	float deptht(deptht) ;
		deptht:name = ""deptht"" ;
		deptht:long_name = ""Vertical T levels"" ;
		deptht:units = ""m"" ;
		deptht:positive = ""down"" ;
		deptht:bounds = ""deptht_bounds"" ;
	float deptht_bounds(deptht, axis_nbounds) ;
		deptht_bounds:units = ""m"" ;
	double time_centered(time_counter) ;
		time_centered:standard_name = ""time"" ;
		time_centered:long_name = ""Time axis"" ;
		time_centered:calendar = ""gregorian"" ;
		time_centered:units = ""seconds since 1900-01-01 00:00:00"" ;
		time_centered:time_origin = ""1900-01-01 00:00:00"" ;
		time_centered:bounds = ""time_centered_bounds"" ;
	double time_centered_bounds(time_counter, axis_nbounds) ;
	double time_counter(time_counter) ;
		time_counter:axis = ""T"" ;
		time_counter:standard_name = ""time"" ;
		time_counter:long_name = ""Time axis"" ;
		time_counter:calendar = ""gregorian"" ;
		time_counter:units = ""seconds since 1900-01-01 00:00:00"" ;
		time_counter:time_origin = ""1900-01-01 00:00:00"" ;
		time_counter:bounds = ""time_counter_bounds"" ;
	double time_counter_bounds(time_counter, axis_nbounds) ;
	float toce(time_counter, deptht, y, x) ;
		toce:standard_name = ""sea_water_potential_temperature"" ;
		toce:long_name = ""temperature"" ;
		toce:units = ""degC"" ;
		toce:online_operation = ""average"" ;
		toce:interval_operation = ""60 s"" ;
		toce:interval_write = ""6 h"" ;
		toce:cell_methods = ""time: mean (interval: 60 s)"" ;
		toce:_FillValue = 1.e+20f ;
		toce:missing_value = 1.e+20f ;
		toce:coordinates = ""time_centered nav_lat nav_lon"" ;
	float soce(time_counter, deptht, y, x) ;
		soce:standard_name = ""sea_water_practical_salinity"" ;
		soce:long_name = ""salinity"" ;
		soce:units = ""1e-3"" ;
		soce:online_operation = ""average"" ;
		soce:interval_operation = ""60 s"" ;
		soce:interval_write = ""6 h"" ;
		soce:cell_methods = ""time: mean (interval: 60 s)"" ;
		soce:_FillValue = 1.e+20f ;
		soce:missing_value = 1.e+20f ;
		soce:coordinates = ""time_centered nav_lat nav_lon"" ;
	float taum(time_counter, y, x) ;
		taum:standard_name = ""magnitude_of_surface_downward_stress"" ;
		taum:long_name = ""wind stress module"" ;
		taum:units = ""N/m2"" ;
		taum:online_operation = ""average"" ;
		taum:interval_operation = ""120 s"" ;
		taum:interval_write = ""6 h"" ;
		taum:cell_methods = ""time: mean (interval: 120 s)"" ;
		taum:_FillValue = 1.e+20f ;
		taum:missing_value = 1.e+20f ;
		taum:coordinates = ""time_centered nav_lat nav_lon"" ;
	float wspd(time_counter, y, x) ;
		wspd:standard_name = ""wind_speed"" ;
		wspd:long_name = ""wind speed module"" ;
		wspd:units = ""m/s"" ;
		wspd:online_operation = ""average"" ;
		wspd:interval_operation = ""120 s"" ;
		wspd:interval_write = ""6 h"" ;
		wspd:cell_methods = ""time: mean (interval: 120 s)"" ;
		wspd:_FillValue = 1.e+20f ;
		wspd:missing_value = 1.e+20f ;
		wspd:coordinates = ""time_centered nav_lat nav_lon"" ;
```
And after the merge, the only difference is in the time dimension that goes from 28 to 280 (or so)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362507511,https://api.github.com/repos/pydata/xarray/issues/7397,1362507511,IC_kwDOAMm_X85RNjb3,5821660,2022-12-22T07:33:39Z,2022-12-22T07:33:39Z,MEMBER,"IIUC the amount of memory is quite what the dimensions suggest (assuming 4byte dtype):

(280 * 200 * 277 * 754 * 4 bytes) / 1024³ = 43.57 GB

I'm not that familiar with the data flow in `to_netcdf` but it's clear that the whole data is read into memory for some reason. The error happens at backend level, so assuming engine=`netcdf4`. You might try with `engine=""h5netcdf""` or consider @TomNicholas suggestion of using `to_zarr` to possibly get the backends out of the equation.

Some questions @benoitespinola : 

Can you show the repr's of the single file Dataset's and the repr of the combined?
Are your final data variables of that size (time: 280, depth: 200, lat: 277, lon: 754)?
Did you do some processing with the data, changing attributes/encoding etc?
Is it possible to create your source data files from scratch with random data? An MCVE showing that would help.

Further suggestions:

If you have multiple data variables, drop all but one prior to saving. Is the behaviour consistent for each of your variables?
Try to be explicit in the call to `open_mfdataset` (eg. adding keyword `chunks` etc.).
Try to open individual files and use `xr.merge`/`xr.concat`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362271360,https://api.github.com/repos/pydata/xarray/issues/7397,1362271360,IC_kwDOAMm_X85RMpyA,35968931,2022-12-22T01:04:39Z,2022-12-22T01:04:39Z,MEMBER,"Thanks for this bug report. FWIW I have also seen this bug recently when helping out a student.

The question here is whether this is an xarray, numpy, or a netcdf bug (or some combo). Can you reproduce the problem using `to_zarr()`? If so that would rule out netcdf as the culprit.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1361621826,https://api.github.com/repos/pydata/xarray/issues/7397,1361621826,IC_kwDOAMm_X85RKLNC,720460,2022-12-21T16:28:15Z,2022-12-21T16:28:15Z,NONE,"By the way,

Using `.encoding` to my data yields to `'complevel': 1`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087