html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/7397#issuecomment-1363988341,https://api.github.com/repos/pydata/xarray/issues/7397,1363988341,IC_kwDOAMm_X85RTM91,720460,2022-12-23T14:15:25Z,2022-12-23T14:15:53Z,NONE,"Because I want to have a worry-free holidays, I wrote a bit of code that basically creates a new NetCDF file from scratch. I load the data from Xarray, change the data to Numpy arrays and use the NetCDF4 library to write the files (does what I want). In the process, I also slice the data and drop unwanted variables to keep just the bits I want (unlike my original post). If I call .load() or .compute() on my xarray variable, the memory goes crazy (even if I am dropping unwanted variables - which I would expect to release memory). The same happens for slicing followed by .compute(). Unfortunately, the MCVE will have to wait until I am back from my holidays. Happy holidays to all!","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087 https://github.com/pydata/xarray/issues/7397#issuecomment-1362583979,https://api.github.com/repos/pydata/xarray/issues/7397,1362583979,IC_kwDOAMm_X85RN2Gr,720460,2022-12-22T09:04:17Z,2022-12-22T09:04:17Z,NONE,"By the way, prior to writing this ticket, I also did the following (which did not help): Drop variables I do not care, keeping dimensions only and toce + soce ; I would expect to need less memory after that.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087 https://github.com/pydata/xarray/issues/7397#issuecomment-1362564754,https://api.github.com/repos/pydata/xarray/issues/7397,1362564754,IC_kwDOAMm_X85RNxaS,720460,2022-12-22T08:44:06Z,2022-12-22T08:44:06Z,NONE,"Answering to the question 'Did you do some processing with the data, changing attributes/encoding etc?': No processing. I do ask xarray to load the data (and I tried also loading + computing) and the final outcome is the same. I try now to do an MCVE with dummy data.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087 https://github.com/pydata/xarray/issues/7397#issuecomment-1362562275,https://api.github.com/repos/pydata/xarray/issues/7397,1362562275,IC_kwDOAMm_X85RNwzj,720460,2022-12-22T08:41:21Z,2022-12-22T08:41:21Z,NONE,"Just tested with to_zarr and it goes through: ``` State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 00:07:55 CPU Efficiency: 63.00% of 00:12:34 core-walltime Job Wall-clock time: 00:06:17 Memory Utilized: 164.89 GB Memory Efficiency: 44.56% of 370.00 GB ``` I did an extra run using a memory profiler as such: ``` import xarray as xr import zarr from memory_profiler import profile @profile def main(): path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc data = xr.open_mfdataset(path) data = data.load() data = data.compute() data.to_zarr() if __name__=='__main__': main() ``` The profiled code was also completed with great success: ``` State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 00:07:52 CPU Efficiency: 63.61% of 00:12:22 core-walltime Job Wall-clock time: 00:06:11 Memory Utilized: 165.53 GB Memory Efficiency: 44.74% of 370.00 GB ``` Here is the outcome for the memory profiling: ``` Line # Mem usage Increment Occurrences Line Contents ============================================================= 5 156.9 MiB 156.9 MiB 1 @profile 6 def main(): 7 156.9 MiB 0.0 MiB 1 path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc 8 209.3 MiB 52.4 MiB 1 data = xr.open_mfdataset(path) 9 10 82150.1 MiB 81940.8 MiB 1 data = data.load() 11 82101.2 MiB -49.0 MiB 1 data = data.compute() 12 13 90091.2 MiB 7990.0 MiB 1 data.to_zarr() ``` PS: in this test I just realized I loaded 8 files instead of 5.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087 https://github.com/pydata/xarray/issues/7397#issuecomment-1362544813,https://api.github.com/repos/pydata/xarray/issues/7397,1362544813,IC_kwDOAMm_X85RNsit,720460,2022-12-22T08:21:31Z,2022-12-22T08:21:31Z,NONE,"A single file (from ncdump -h): ``` dimensions: axis_nbounds = 2 ; x = 754 ; y = 277 ; deptht = 200 ; time_counter = UNLIMITED ; // (28 currently) variables: float nav_lat(y, x) ; nav_lat:standard_name = ""latitude"" ; nav_lat:long_name = ""Latitude"" ; nav_lat:units = ""degrees_north"" ; float nav_lon(y, x) ; nav_lon:standard_name = ""longitude"" ; nav_lon:long_name = ""Longitude"" ; nav_lon:units = ""degrees_east"" ; float deptht(deptht) ; deptht:name = ""deptht"" ; deptht:long_name = ""Vertical T levels"" ; deptht:units = ""m"" ; deptht:positive = ""down"" ; deptht:bounds = ""deptht_bounds"" ; float deptht_bounds(deptht, axis_nbounds) ; deptht_bounds:units = ""m"" ; double time_centered(time_counter) ; time_centered:standard_name = ""time"" ; time_centered:long_name = ""Time axis"" ; time_centered:calendar = ""gregorian"" ; time_centered:units = ""seconds since 1900-01-01 00:00:00"" ; time_centered:time_origin = ""1900-01-01 00:00:00"" ; time_centered:bounds = ""time_centered_bounds"" ; double time_centered_bounds(time_counter, axis_nbounds) ; double time_counter(time_counter) ; time_counter:axis = ""T"" ; time_counter:standard_name = ""time"" ; time_counter:long_name = ""Time axis"" ; time_counter:calendar = ""gregorian"" ; time_counter:units = ""seconds since 1900-01-01 00:00:00"" ; time_counter:time_origin = ""1900-01-01 00:00:00"" ; time_counter:bounds = ""time_counter_bounds"" ; double time_counter_bounds(time_counter, axis_nbounds) ; float toce(time_counter, deptht, y, x) ; toce:standard_name = ""sea_water_potential_temperature"" ; toce:long_name = ""temperature"" ; toce:units = ""degC"" ; toce:online_operation = ""average"" ; toce:interval_operation = ""60 s"" ; toce:interval_write = ""6 h"" ; toce:cell_methods = ""time: mean (interval: 60 s)"" ; toce:_FillValue = 1.e+20f ; toce:missing_value = 1.e+20f ; toce:coordinates = ""time_centered nav_lat nav_lon"" ; float soce(time_counter, deptht, y, x) ; soce:standard_name = ""sea_water_practical_salinity"" ; soce:long_name = ""salinity"" ; soce:units = ""1e-3"" ; soce:online_operation = ""average"" ; soce:interval_operation = ""60 s"" ; soce:interval_write = ""6 h"" ; soce:cell_methods = ""time: mean (interval: 60 s)"" ; soce:_FillValue = 1.e+20f ; soce:missing_value = 1.e+20f ; soce:coordinates = ""time_centered nav_lat nav_lon"" ; float taum(time_counter, y, x) ; taum:standard_name = ""magnitude_of_surface_downward_stress"" ; taum:long_name = ""wind stress module"" ; taum:units = ""N/m2"" ; taum:online_operation = ""average"" ; taum:interval_operation = ""120 s"" ; taum:interval_write = ""6 h"" ; taum:cell_methods = ""time: mean (interval: 120 s)"" ; taum:_FillValue = 1.e+20f ; taum:missing_value = 1.e+20f ; taum:coordinates = ""time_centered nav_lat nav_lon"" ; float wspd(time_counter, y, x) ; wspd:standard_name = ""wind_speed"" ; wspd:long_name = ""wind speed module"" ; wspd:units = ""m/s"" ; wspd:online_operation = ""average"" ; wspd:interval_operation = ""120 s"" ; wspd:interval_write = ""6 h"" ; wspd:cell_methods = ""time: mean (interval: 120 s)"" ; wspd:_FillValue = 1.e+20f ; wspd:missing_value = 1.e+20f ; wspd:coordinates = ""time_centered nav_lat nav_lon"" ; ``` And after the merge, the only difference is in the time dimension that goes from 28 to 280 (or so)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087 https://github.com/pydata/xarray/issues/7397#issuecomment-1362507511,https://api.github.com/repos/pydata/xarray/issues/7397,1362507511,IC_kwDOAMm_X85RNjb3,5821660,2022-12-22T07:33:39Z,2022-12-22T07:33:39Z,MEMBER,"IIUC the amount of memory is quite what the dimensions suggest (assuming 4byte dtype): (280 * 200 * 277 * 754 * 4 bytes) / 1024³ = 43.57 GB I'm not that familiar with the data flow in `to_netcdf` but it's clear that the whole data is read into memory for some reason. The error happens at backend level, so assuming engine=`netcdf4`. You might try with `engine=""h5netcdf""` or consider @TomNicholas suggestion of using `to_zarr` to possibly get the backends out of the equation. Some questions @benoitespinola : Can you show the repr's of the single file Dataset's and the repr of the combined? Are your final data variables of that size (time: 280, depth: 200, lat: 277, lon: 754)? Did you do some processing with the data, changing attributes/encoding etc? Is it possible to create your source data files from scratch with random data? An MCVE showing that would help. Further suggestions: If you have multiple data variables, drop all but one prior to saving. Is the behaviour consistent for each of your variables? Try to be explicit in the call to `open_mfdataset` (eg. adding keyword `chunks` etc.). Try to open individual files and use `xr.merge`/`xr.concat`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087 https://github.com/pydata/xarray/issues/7397#issuecomment-1362271360,https://api.github.com/repos/pydata/xarray/issues/7397,1362271360,IC_kwDOAMm_X85RMpyA,35968931,2022-12-22T01:04:39Z,2022-12-22T01:04:39Z,MEMBER,"Thanks for this bug report. FWIW I have also seen this bug recently when helping out a student. The question here is whether this is an xarray, numpy, or a netcdf bug (or some combo). Can you reproduce the problem using `to_zarr()`? If so that would rule out netcdf as the culprit.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087 https://github.com/pydata/xarray/issues/7397#issuecomment-1361621826,https://api.github.com/repos/pydata/xarray/issues/7397,1361621826,IC_kwDOAMm_X85RKLNC,720460,2022-12-21T16:28:15Z,2022-12-21T16:28:15Z,NONE,"By the way, Using `.encoding` to my data yields to `'complevel': 1`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1506437087