id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1340994913,I_kwDOAMm_X85P7fVh,6924,Memory Leakage Issue When Running to_netcdf ,64621312,closed,0,,,2,2022-08-16T23:58:17Z,2023-01-17T18:38:40Z,2023-01-17T18:38:40Z,NONE,,,,"### What is your issue? I have a zarr file that I'd like to convert to a netcdf which is too large to fit in memory. My computer has 32GB of RAM so writing ~5.5GB chunks shouldn't be a problem. However, within seconds of running this script, my memory usage quickly tops out consuming the available ~20GB and the script fails. Data: [Dropbox link](https://www.dropbox.com/sh/xmcz93p53n1w3ft/AACjI9EskzwKsA8sp-WmM2BFa?dl=0) to zarr file containing radar rainfall data for 6/28/2014 over the United States that is around 1.8GB in total. Code: ```python import xarray as xr import zarr fpath_zarr = ""out_zarr_20140628.zarr"" ds_from_zarr = xr.open_zarr(store=fpath_zarr, chunks={'outlat':3500, 'outlon':7000, 'time':30}) ds_from_zarr.to_netcdf(""ds_zarr_to_nc.nc"", encoding= {""rainrate"":{""zlib"":True}}) ``` Outputs: ```python MemoryError: Unable to allocate 5.48 GiB for an array with shape (30, 3500, 7000) and data type float64 ``` Package versions: ``` dask 2022.7.0 xarray 2022.3.0 zarr 2.8.1 ``` ![memory_screenshot](https://user-images.githubusercontent.com/64621312/185004542-7c91bcbc-7e7b-4656-a306-732bc1d2e9c3.jpg) ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6924/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1368696980,I_kwDOAMm_X85RlKiU,7018,Writing netcdf after running xarray.dataset.reindex to fill gaps in a time series fails due to memory allocation error,64621312,open,0,,,3,2022-09-10T18:21:48Z,2022-09-15T19:59:39Z,,NONE,,,,"# Problem Summary I am attempting to convert a.grib2 file representing a single day's worth of gridded radar rainfall data spanning the continental US, into a netcdf. When a .grib2 is missing timesteps, I am attempting to fill them in with NA values using `xarray.Dataset.reindex` before running `xarray.Dataset.to_netcdf`. However, after I've reindexed the dataset, the script fails due to a memory allocation error. It succeeds if I don't reindex. One clue could be in the fact that the dataset chunks are set to `(70, 3500, 7000)`, but when `ds.to_netcdf` is called, the script fails because it's attempting to load a chunk with dimensions `(210, 3500, 7000)`. # Accessing Full Reproducible Example The code and data to reproduce my results can be downloaded from [this Dropbox link](https://www.dropbox.com/sh/w31kpx2u13ymg3j/AAB6Gzf6fqetgk1FViRbKm2Ba?dl=0). The code is also shown below followed by the outputs. Potentially relevant OS and environment information are shown below as well. # Code ```python #%% Import libraries import time start_time = time.time() import xarray as xr import cfgrib from glob import glob import pandas as pd import dask dask.config.set(**{'array.slicing.split_large_chunks': False}) # to silence warnings of loading large slice into memory dask.config.set(scheduler='synchronous') # this forces single threaded computations (netcdfs can only be written serially) #%% parameters chnk_sz = ""7000MB"" fl_out_nc = ""out_netcdfs/20010101.nc"" fldr_in_grib = ""in_gribs/20010101.grib2"" #%% loading and exporting dataset ds = xr.open_dataset(fldr_in_grib, engine=""cfgrib"", chunks={""time"":chnk_sz}, backend_kwargs={'indexpath': ''}) # reindex start_date = pd.to_datetime('2001-01-01') tstep = pd.Timedelta('0 days 00:05:00') new_index = pd.date_range(start=start_date, end=start_date + pd.Timedelta(1, ""day""),\ freq=tstep, inclusive='left') ds = ds.reindex(indexers={""time"":new_index}) ds = ds.unify_chunks() ds = ds.chunk(chunks={'time':chnk_sz}) print(""######## INSPECTING DATASET PRIOR TO WRITING TO NETCDF ########"") print(ds) print(' ') print(""######## ERROR MESSAGE ########"") ds.to_netcdf(fl_out_nc, encoding= {""unknown"":{""zlib"":True}}) ``` # Outputs ``` ######## INSPECTING DATASET PRIOR TO WRITING TO NETCDF ######## Dimensions: (time: 288, latitude: 3500, longitude: 7000) Coordinates: * time (time) datetime64[ns] 2001-01-01 ... 2001-01-01T23:55:00 * latitude (latitude) float64 54.99 54.98 54.98 54.97 ... 20.03 20.02 20.01 * longitude (longitude) float64 230.0 230.0 230.0 ... 300.0 300.0 300.0 step timedelta64[ns] ... surface float64 ... valid_time (time) datetime64[ns] dask.array Data variables: unknown (time, latitude, longitude) float32 dask.array Attributes: GRIB_edition: 2 GRIB_centre: 161 GRIB_centreDescription: 161 GRIB_subCentre: 0 Conventions: CF-1.7 institution: 161 history: 2022-09-10T14:50 GRIB to CDM+CF via cfgrib-0.9.1... ######## ERROR MESSAGE ######## Output exceeds the size limit. Open the full output data in a text editor --------------------------------------------------------------------------- MemoryError Traceback (most recent call last) d:\Dropbox\_Sharing\reprex\2022-9-9_writing_ncdf_fails\reprex\exporting_netcdfs_reduced.py in () 160 print(' ') 161 print(""######## ERROR MESSAGE ########"") ---> 162 ds.to_netcdf(fl_out_nc, encoding= {""unknown"":{""zlib"":True}}) File c:\Users\Daniel\anaconda3\envs\weather_gen_3\lib\site-packages\xarray\core\dataset.py:1882, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf) 1879 encoding = {} 1880 from ..backends.api import to_netcdf -> 1882 return to_netcdf( # type: ignore # mypy cannot resolve the overloads:( 1883 self, 1884 path, 1885 mode=mode, 1886 format=format, 1887 group=group, 1888 engine=engine, 1889 encoding=encoding, 1890 unlimited_dims=unlimited_dims, 1891 compute=compute, 1892 multifile=False, 1893 invalid_netcdf=invalid_netcdf, 1894 ) File c:\Users\xxxxx\anaconda3\envs\weather_gen_3\lib\site-packages\xarray\backends\api.py:1219, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf) ... 121 return arg File <__array_function__ internals>:180, in where(*args, **kwargs) MemoryError: Unable to allocate 19.2 GiB for an array with shape (210, 3500, 7000) and data type float32 ``` # Environment ```python windows 11 Home xarray 2022.3.0 cfgrib 0.9.10.1 dask 2022.7.0 ```","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7018/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,reopened,13221727,issue 1340474484,I_kwDOAMm_X85P5gR0,6920,Writing a netCDF file is slow,64621312,closed,1,,,3,2022-08-16T14:48:37Z,2022-08-16T17:05:37Z,2022-08-16T17:05:37Z,NONE,,,,"### What is your issue? This has been discussed in [another thread](https://github.com/pydata/xarray/issues/2912), but the proposed solution there (first `.load()` the dataset into memory before running `to_netcdf`) does not work for me since my dataset is too large to fit into memory. The following code takes around 8 hours to run. You'll notice that I tried both `xr.open_mfdataset` and `xr.concat` in case it would make a difference, but it doesn't. I also tried profiling the code according to [this example](https://docs.dask.org/en/latest/diagnostics-local.html#example). The results are in this [html](https://www.dropbox.com/sh/42gzmne9a06qo8m/AAB6qqiFFQOScg8Ou4hH5GoZa?dl=0) (dropbox link) but I'm not really sure what I'm looking at. Data: [dropbox link](https://www.dropbox.com/sh/onr9l7g7n254848/AAD9vkvWFg1FbinZ-EHHC7L2a?dl=0) to 717 netcdf files containing radar rainfall data for 6/28/2014 over the United States that is around 1GB in total. Code: ```python #%% Import libraries import xarray as xr from glob import glob import pandas as pd import time import dask dask.config.set(**{'array.slicing.split_large_chunks': False}) files = glob(""data/*.nc"") #%% functions def extract_file_timestep(fname): fname = fname.split('/')[-1] fname = fname.split(""."") ftype = fname.pop(-1) fname = ''.join(fname) str_tstep = fname.split(""_"")[-1] if ftype == ""nc"": date_format = '%Y%m%d%H%M' if ftype == ""grib2"": date_format = '%Y%m%d-%H%M%S' tstep = pd.to_datetime(str_tstep, format=date_format) return tstep def ds_preprocessing(ds): tstamp = extract_file_timestep(ds.encoding['source']) ds.coords[""time""] = tstamp ds = ds.expand_dims({""time"":1}) ds = ds.rename({""lon"":""longitude"", ""lat"":""latitude"", ""mrms_a2m"":""rainrate""}) ds = ds.chunk(chunks={""latitude"":3500, ""longitude"":7000, ""time"":1}) return ds #%% Loading and formatting data lst_ds = [] start_time = time.time() for f in files: ds = xr.open_dataset(f, chunks={""latitude"":3500, ""longitude"":7000}) ds = ds_preprocessing(ds) lst_ds.append(ds) ds_comb_frm_lst = xr.concat(lst_ds, dim=""time"") print(""Time to load dataset using concat on list of datasets: {}"".format(time.time() - start_time)) start_time = time.time() ds_comb_frm_open_mfdataset = xr.open_mfdataset(files, chunks={""latitude"":3500, ""longitude"":7000}, concat_dim = ""time"", preprocess=ds_preprocessing, combine=""nested"") print(""Time to load dataset using open_mfdataset: {}"".format(time.time() - start_time)) #%% exporting to netcdf start_time = time.time() ds_comb_frm_lst.to_netcdf(""ds_comb_frm_lst.nc"", encoding= {""rainrate"":{""zlib"":True}}) print(""Time to export dataset created using concat on list of datasets: {}"".format(time.time() - start_time)) start_time = time.time() ds_comb_frm_open_mfdataset.to_netcdf(""ds_comb_frm_open_mfdataset.nc"", encoding= {""rainrate"":{""zlib"":True}}) print(""Time to export dataset created using open_mfdataset: {}"".format(time.time() - start_time)) ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6920/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1332143835,I_kwDOAMm_X85PZubb,6892,2 Dimension Plot Producing Discontinuous Grid,64621312,closed,0,,,1,2022-08-08T16:59:14Z,2022-08-08T17:12:41Z,2022-08-08T17:11:44Z,NONE,,,,"### What is your issue? **Problem:** I'm expecting a plot that looks like the one [here](https://docs.xarray.dev/en/stable/user-guide/plotting.html#id2) (Plotting-->Two Dimensions-->Simple Example) with a continuous grid, but instead I'm getting the plot below which has a discontinuous grid. This could be due to different spacing in the x and y dimensions (0.005 spacing in the `outlat` dimension and 0.00328768 spacing in the `outlon` dimension), but I don't know what to do about it. ![image](https://user-images.githubusercontent.com/64621312/183471078-e2a76231-1f5e-4b13-8ca5-511af22bf792.png) **Data:** [Dropbox download link for 20 years of monthly rainfall totals covering Norfolk, VA in netcdf format (2.2MB)](https://www.dropbox.com/s/so61kkqosvru9q6/monthly_rainfall.nc?dl=0) **Reprex:** ```python import xarray as xr ds= xr.open_dataset(""monthly_rainfall.nc"") ds.rainrate.isel(time=100).plot() ```","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6892/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1308176241,I_kwDOAMm_X85N-S9x,6805,PermissionError: [Errno 13] Permission denied,64621312,closed,0,,,5,2022-07-18T16:05:31Z,2022-07-18T17:58:38Z,2022-07-18T17:58:38Z,NONE,,,,"### What is your issue? This was raised about a year ago but still seems to be unresolved, so I'm hoping this will bring attention back to the issue. (https://github.com/pydata/xarray/issues/5488) **Data**: [dropbox sharing link](https://www.dropbox.com/sh/1jfwpzas0vfqd3o/AAAOaQsgjLBqYIc37ucshOMwa?dl=0) **Description**: This folder contains 2 files each containing 1 day's worth of 1kmx1km gridded precipitation rate data from the National Severe Storms Laboratory. Each is about a gig (sorry they're so big, but it's what I'm working with!) **Code**: ```python import xarray as xr f_in_ncs = ""data/"" f_in_nc = ""data/20190520.nc"" #%% works ds = xr.open_dataset(f_in_nc, chunks={'outlat':3500, 'outlon':7000, 'time':50}) #%% doesn't work mf_ds = xr.open_mfdataset(f_in_ncs, concat_dim = ""time"", chunks={'outlat':3500, 'outlon':7000, 'time':50}, combine = ""nested"", engine = 'netcdf4') ``` **Error**: ```Python Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?1f03b506-1f93-46ca-ad53-ff5a1ca1a767) --------------------------------------------------------------------------- KeyError Traceback (most recent call last) File c:\Users\Daniel\anaconda3\envs\mrms\lib\site-packages\xarray\backends\file_manager.py:199, in CachingFileManager._acquire_with_cache_info(self, needs_lock) [198](file:///c%3A/Users/Daniel/anaconda3/envs/mrms/lib/site-packages/xarray/backends/file_manager.py?line=197) try: --> [199](file:///c%3A/Users/Daniel/anaconda3/envs/mrms/lib/site-packages/xarray/backends/file_manager.py?line=198) file = self._cache[self._key] [200](file:///c%3A/Users/Daniel/anaconda3/envs/mrms/lib/site-packages/xarray/backends/file_manager.py?line=199) except KeyError: File c:\Users\Daniel\anaconda3\envs\mrms\lib\site-packages\xarray\backends\lru_cache.py:53, in LRUCache.__getitem__(self, key) [52](file:///c%3A/Users/Daniel/anaconda3/envs/mrms/lib/site-packages/xarray/backends/lru_cache.py?line=51) with self._lock: ---> [53](file:///c%3A/Users/Daniel/anaconda3/envs/mrms/lib/site-packages/xarray/backends/lru_cache.py?line=52) value = self._cache[key] [54](file:///c%3A/Users/Daniel/anaconda3/envs/mrms/lib/site-packages/xarray/backends/lru_cache.py?line=53) self._cache.move_to_end(key) KeyError: [, ('d:\\mrms_processing\\_reprex\\2022-7-18_open_mfdataset\\data',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))] During handling of the above exception, another exception occurred: PermissionError Traceback (most recent call last) Input In [4], in () 1 import xarray as xr 3 f_in_ncs = ""data/"" ----> 5 ds = xr.open_mfdataset(f_in_ncs, concat_dim = ""time"", 6 chunks={'outlat':3500, 'outlon':7000, 'time':50}, 7 combine = ""nested"", engine = 'netcdf4') File c:\Users\Daniel\anaconda3\envs\mrms\lib\site-packages\xarray\backends\api.py:908, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs) ... File src\netCDF4\_netCDF4.pyx:2307, in netCDF4._netCDF4.Dataset.__init__() File src\netCDF4\_netCDF4.pyx:1925, in netCDF4._netCDF4._ensure_nc_success() PermissionError: [Errno 13] Permission denied: b'd:\\mrms_processing\\_reprex\\2022-7-18_open_mfdataset\\data' ```","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6805/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue