html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/7397#issuecomment-1363988341,https://api.github.com/repos/pydata/xarray/issues/7397,1363988341,IC_kwDOAMm_X85RTM91,720460,2022-12-23T14:15:25Z,2022-12-23T14:15:53Z,NONE,"Because I want to have a worry-free holidays, I wrote a bit of code that basically creates a new NetCDF file from scratch. I load the data from Xarray, change the data to Numpy arrays and use the NetCDF4 library to write the files (does what I want).
In the process, I also slice the data and drop unwanted variables to keep just the bits I want (unlike my original post).
If I call .load() or .compute() on my xarray variable, the memory goes crazy (even if I am dropping unwanted variables - which I would expect to release memory). The same happens for slicing followed by .compute().
Unfortunately, the MCVE will have to wait until I am back from my holidays.
Happy holidays to all!","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362583979,https://api.github.com/repos/pydata/xarray/issues/7397,1362583979,IC_kwDOAMm_X85RN2Gr,720460,2022-12-22T09:04:17Z,2022-12-22T09:04:17Z,NONE,"By the way, prior to writing this ticket, I also did the following (which did not help):
Drop variables I do not care, keeping dimensions only and toce + soce ; I would expect to need less memory after that.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362564754,https://api.github.com/repos/pydata/xarray/issues/7397,1362564754,IC_kwDOAMm_X85RNxaS,720460,2022-12-22T08:44:06Z,2022-12-22T08:44:06Z,NONE,"Answering to the question 'Did you do some processing with the data, changing attributes/encoding etc?':
No processing. I do ask xarray to load the data (and I tried also loading + computing) and the final outcome is the same.
I try now to do an MCVE with dummy data.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362562275,https://api.github.com/repos/pydata/xarray/issues/7397,1362562275,IC_kwDOAMm_X85RNwzj,720460,2022-12-22T08:41:21Z,2022-12-22T08:41:21Z,NONE,"Just tested with to_zarr and it goes through:
```
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:07:55
CPU Efficiency: 63.00% of 00:12:34 core-walltime
Job Wall-clock time: 00:06:17
Memory Utilized: 164.89 GB
Memory Efficiency: 44.56% of 370.00 GB
```
I did an extra run using a memory profiler as such:
```
import xarray as xr
import zarr
from memory_profiler import profile
@profile
def main():
path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc
data = xr.open_mfdataset(path)
data = data.load()
data = data.compute()
data.to_zarr()
if __name__=='__main__':
main()
```
The profiled code was also completed with great success:
```
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:07:52
CPU Efficiency: 63.61% of 00:12:22 core-walltime
Job Wall-clock time: 00:06:11
Memory Utilized: 165.53 GB
Memory Efficiency: 44.74% of 370.00 GB
```
Here is the outcome for the memory profiling:
```
Line # Mem usage Increment Occurrences Line Contents
=============================================================
5 156.9 MiB 156.9 MiB 1 @profile
6 def main():
7 156.9 MiB 0.0 MiB 1 path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc
8 209.3 MiB 52.4 MiB 1 data = xr.open_mfdataset(path)
9
10 82150.1 MiB 81940.8 MiB 1 data = data.load()
11 82101.2 MiB -49.0 MiB 1 data = data.compute()
12
13 90091.2 MiB 7990.0 MiB 1 data.to_zarr()
```
PS: in this test I just realized I loaded 8 files instead of 5.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362544813,https://api.github.com/repos/pydata/xarray/issues/7397,1362544813,IC_kwDOAMm_X85RNsit,720460,2022-12-22T08:21:31Z,2022-12-22T08:21:31Z,NONE,"A single file (from ncdump -h):
```
dimensions:
axis_nbounds = 2 ;
x = 754 ;
y = 277 ;
deptht = 200 ;
time_counter = UNLIMITED ; // (28 currently)
variables:
float nav_lat(y, x) ;
nav_lat:standard_name = ""latitude"" ;
nav_lat:long_name = ""Latitude"" ;
nav_lat:units = ""degrees_north"" ;
float nav_lon(y, x) ;
nav_lon:standard_name = ""longitude"" ;
nav_lon:long_name = ""Longitude"" ;
nav_lon:units = ""degrees_east"" ;
float deptht(deptht) ;
deptht:name = ""deptht"" ;
deptht:long_name = ""Vertical T levels"" ;
deptht:units = ""m"" ;
deptht:positive = ""down"" ;
deptht:bounds = ""deptht_bounds"" ;
float deptht_bounds(deptht, axis_nbounds) ;
deptht_bounds:units = ""m"" ;
double time_centered(time_counter) ;
time_centered:standard_name = ""time"" ;
time_centered:long_name = ""Time axis"" ;
time_centered:calendar = ""gregorian"" ;
time_centered:units = ""seconds since 1900-01-01 00:00:00"" ;
time_centered:time_origin = ""1900-01-01 00:00:00"" ;
time_centered:bounds = ""time_centered_bounds"" ;
double time_centered_bounds(time_counter, axis_nbounds) ;
double time_counter(time_counter) ;
time_counter:axis = ""T"" ;
time_counter:standard_name = ""time"" ;
time_counter:long_name = ""Time axis"" ;
time_counter:calendar = ""gregorian"" ;
time_counter:units = ""seconds since 1900-01-01 00:00:00"" ;
time_counter:time_origin = ""1900-01-01 00:00:00"" ;
time_counter:bounds = ""time_counter_bounds"" ;
double time_counter_bounds(time_counter, axis_nbounds) ;
float toce(time_counter, deptht, y, x) ;
toce:standard_name = ""sea_water_potential_temperature"" ;
toce:long_name = ""temperature"" ;
toce:units = ""degC"" ;
toce:online_operation = ""average"" ;
toce:interval_operation = ""60 s"" ;
toce:interval_write = ""6 h"" ;
toce:cell_methods = ""time: mean (interval: 60 s)"" ;
toce:_FillValue = 1.e+20f ;
toce:missing_value = 1.e+20f ;
toce:coordinates = ""time_centered nav_lat nav_lon"" ;
float soce(time_counter, deptht, y, x) ;
soce:standard_name = ""sea_water_practical_salinity"" ;
soce:long_name = ""salinity"" ;
soce:units = ""1e-3"" ;
soce:online_operation = ""average"" ;
soce:interval_operation = ""60 s"" ;
soce:interval_write = ""6 h"" ;
soce:cell_methods = ""time: mean (interval: 60 s)"" ;
soce:_FillValue = 1.e+20f ;
soce:missing_value = 1.e+20f ;
soce:coordinates = ""time_centered nav_lat nav_lon"" ;
float taum(time_counter, y, x) ;
taum:standard_name = ""magnitude_of_surface_downward_stress"" ;
taum:long_name = ""wind stress module"" ;
taum:units = ""N/m2"" ;
taum:online_operation = ""average"" ;
taum:interval_operation = ""120 s"" ;
taum:interval_write = ""6 h"" ;
taum:cell_methods = ""time: mean (interval: 120 s)"" ;
taum:_FillValue = 1.e+20f ;
taum:missing_value = 1.e+20f ;
taum:coordinates = ""time_centered nav_lat nav_lon"" ;
float wspd(time_counter, y, x) ;
wspd:standard_name = ""wind_speed"" ;
wspd:long_name = ""wind speed module"" ;
wspd:units = ""m/s"" ;
wspd:online_operation = ""average"" ;
wspd:interval_operation = ""120 s"" ;
wspd:interval_write = ""6 h"" ;
wspd:cell_methods = ""time: mean (interval: 120 s)"" ;
wspd:_FillValue = 1.e+20f ;
wspd:missing_value = 1.e+20f ;
wspd:coordinates = ""time_centered nav_lat nav_lon"" ;
```
And after the merge, the only difference is in the time dimension that goes from 28 to 280 (or so)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362507511,https://api.github.com/repos/pydata/xarray/issues/7397,1362507511,IC_kwDOAMm_X85RNjb3,5821660,2022-12-22T07:33:39Z,2022-12-22T07:33:39Z,MEMBER,"IIUC the amount of memory is quite what the dimensions suggest (assuming 4byte dtype):
(280 * 200 * 277 * 754 * 4 bytes) / 1024³ = 43.57 GB
I'm not that familiar with the data flow in `to_netcdf` but it's clear that the whole data is read into memory for some reason. The error happens at backend level, so assuming engine=`netcdf4`. You might try with `engine=""h5netcdf""` or consider @TomNicholas suggestion of using `to_zarr` to possibly get the backends out of the equation.
Some questions @benoitespinola :
Can you show the repr's of the single file Dataset's and the repr of the combined?
Are your final data variables of that size (time: 280, depth: 200, lat: 277, lon: 754)?
Did you do some processing with the data, changing attributes/encoding etc?
Is it possible to create your source data files from scratch with random data? An MCVE showing that would help.
Further suggestions:
If you have multiple data variables, drop all but one prior to saving. Is the behaviour consistent for each of your variables?
Try to be explicit in the call to `open_mfdataset` (eg. adding keyword `chunks` etc.).
Try to open individual files and use `xr.merge`/`xr.concat`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1362271360,https://api.github.com/repos/pydata/xarray/issues/7397,1362271360,IC_kwDOAMm_X85RMpyA,35968931,2022-12-22T01:04:39Z,2022-12-22T01:04:39Z,MEMBER,"Thanks for this bug report. FWIW I have also seen this bug recently when helping out a student.
The question here is whether this is an xarray, numpy, or a netcdf bug (or some combo). Can you reproduce the problem using `to_zarr()`? If so that would rule out netcdf as the culprit.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087
https://github.com/pydata/xarray/issues/7397#issuecomment-1361621826,https://api.github.com/repos/pydata/xarray/issues/7397,1361621826,IC_kwDOAMm_X85RKLNC,720460,2022-12-21T16:28:15Z,2022-12-21T16:28:15Z,NONE,"By the way,
Using `.encoding` to my data yields to `'complevel': 1`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087