id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1368696980,I_kwDOAMm_X85RlKiU,7018,Writing netcdf after running xarray.dataset.reindex to fill gaps in a time series fails due to memory allocation error,64621312,open,0,,,3,2022-09-10T18:21:48Z,2022-09-15T19:59:39Z,,NONE,,,,"# Problem Summary I am attempting to convert a.grib2 file representing a single day's worth of gridded radar rainfall data spanning the continental US, into a netcdf. When a .grib2 is missing timesteps, I am attempting to fill them in with NA values using `xarray.Dataset.reindex` before running `xarray.Dataset.to_netcdf`. However, after I've reindexed the dataset, the script fails due to a memory allocation error. It succeeds if I don't reindex. One clue could be in the fact that the dataset chunks are set to `(70, 3500, 7000)`, but when `ds.to_netcdf` is called, the script fails because it's attempting to load a chunk with dimensions `(210, 3500, 7000)`. # Accessing Full Reproducible Example The code and data to reproduce my results can be downloaded from [this Dropbox link](https://www.dropbox.com/sh/w31kpx2u13ymg3j/AAB6Gzf6fqetgk1FViRbKm2Ba?dl=0). The code is also shown below followed by the outputs. Potentially relevant OS and environment information are shown below as well. # Code ```python #%% Import libraries import time start_time = time.time() import xarray as xr import cfgrib from glob import glob import pandas as pd import dask dask.config.set(**{'array.slicing.split_large_chunks': False}) # to silence warnings of loading large slice into memory dask.config.set(scheduler='synchronous') # this forces single threaded computations (netcdfs can only be written serially) #%% parameters chnk_sz = ""7000MB"" fl_out_nc = ""out_netcdfs/20010101.nc"" fldr_in_grib = ""in_gribs/20010101.grib2"" #%% loading and exporting dataset ds = xr.open_dataset(fldr_in_grib, engine=""cfgrib"", chunks={""time"":chnk_sz}, backend_kwargs={'indexpath': ''}) # reindex start_date = pd.to_datetime('2001-01-01') tstep = pd.Timedelta('0 days 00:05:00') new_index = pd.date_range(start=start_date, end=start_date + pd.Timedelta(1, ""day""),\ freq=tstep, inclusive='left') ds = ds.reindex(indexers={""time"":new_index}) ds = ds.unify_chunks() ds = ds.chunk(chunks={'time':chnk_sz}) print(""######## INSPECTING DATASET PRIOR TO WRITING TO NETCDF ########"") print(ds) print(' ') print(""######## ERROR MESSAGE ########"") ds.to_netcdf(fl_out_nc, encoding= {""unknown"":{""zlib"":True}}) ``` # Outputs ``` ######## INSPECTING DATASET PRIOR TO WRITING TO NETCDF ######## Dimensions: (time: 288, latitude: 3500, longitude: 7000) Coordinates: * time (time) datetime64[ns] 2001-01-01 ... 2001-01-01T23:55:00 * latitude (latitude) float64 54.99 54.98 54.98 54.97 ... 20.03 20.02 20.01 * longitude (longitude) float64 230.0 230.0 230.0 ... 300.0 300.0 300.0 step timedelta64[ns] ... surface float64 ... valid_time (time) datetime64[ns] dask.array Data variables: unknown (time, latitude, longitude) float32 dask.array Attributes: GRIB_edition: 2 GRIB_centre: 161 GRIB_centreDescription: 161 GRIB_subCentre: 0 Conventions: CF-1.7 institution: 161 history: 2022-09-10T14:50 GRIB to CDM+CF via cfgrib-0.9.1... ######## ERROR MESSAGE ######## Output exceeds the size limit. Open the full output data in a text editor --------------------------------------------------------------------------- MemoryError Traceback (most recent call last) d:\Dropbox\_Sharing\reprex\2022-9-9_writing_ncdf_fails\reprex\exporting_netcdfs_reduced.py in () 160 print(' ') 161 print(""######## ERROR MESSAGE ########"") ---> 162 ds.to_netcdf(fl_out_nc, encoding= {""unknown"":{""zlib"":True}}) File c:\Users\Daniel\anaconda3\envs\weather_gen_3\lib\site-packages\xarray\core\dataset.py:1882, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf) 1879 encoding = {} 1880 from ..backends.api import to_netcdf -> 1882 return to_netcdf( # type: ignore # mypy cannot resolve the overloads:( 1883 self, 1884 path, 1885 mode=mode, 1886 format=format, 1887 group=group, 1888 engine=engine, 1889 encoding=encoding, 1890 unlimited_dims=unlimited_dims, 1891 compute=compute, 1892 multifile=False, 1893 invalid_netcdf=invalid_netcdf, 1894 ) File c:\Users\xxxxx\anaconda3\envs\weather_gen_3\lib\site-packages\xarray\backends\api.py:1219, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf) ... 121 return arg File <__array_function__ internals>:180, in where(*args, **kwargs) MemoryError: Unable to allocate 19.2 GiB for an array with shape (210, 3500, 7000) and data type float32 ``` # Environment ```python windows 11 Home xarray 2022.3.0 cfgrib 0.9.10.1 dask 2022.7.0 ```","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7018/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,reopened,13221727,issue