issues: 638414463

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
638414463	MDU6SXNzdWU2Mzg0MTQ0NjM=	4153	Certain dataset methods on chunked arrays seem to interfere with loading/writing files	66918146	open	0			2	2020-06-14T19:26:32Z	2022-04-18T05:31:54Z		NONE				I have a big hourly based dataset that I need to resample on a daily basis. I am trying to implement a faster routine of that resample because it covers multiple timezones and therefore can't only use a simple "xarray.dataset.resample" method. In short both of my resampling functions: Function 1 (old one): I use an added time offset variable (that I generated and added myself) that are simple floats. According to the offset, I re-assign my time coordinate with the timedelta applied, resample my dataset (hourly -> daily) and use xr.where to assign values. Function 2 (new one I want to use): I reindex my dataset with a foward fill method so it becomes on a 30min basis (I do that because I have offset of 2.5 hours and shift only take integers). I then shift my dataset accordingly to offset2, use xr.where to assign values and resample at the end. I want to use the second function because I am opening my data with "open_mfdataset" and the second function runs about 5-10 times faster. The problem comes when I want to access/load/write my data...the result yield by the first function (using only resample) is able to load onto the memory but the result yielded by the second function (using shift) yields a memory error. I don't know a lot about dask, but maybe the shifting method is creating some kind of bug with the scheduler..? MCVE Code Sample First function: ```python def daily_mean_vars(ds, freq): """ compute daily mean, max and min of xarray dataset object Args: ds (obj): dataset object f (str): frequency returns: new_ds (obj): updated dataset object """ first_year = ds.time[0].dt.year.values time = ds.time.to_index() to = np.unique(ds.timeOffset.values) new_ds = ds['d2m'].resample(time=freq, keep_attrs=True).mean() new_ds = new_ds.to_dataset() new_ds = new_ds.rename({'d2m': 'd2mday']}) `for l in tqdm(to): #convert time to timezone tt = time + pd.Timedelta(hours = l) #update ds index ds = ds.assign_coords(time = tt) #resample mean = ds['d2m'].resample(time=f, keep_attrs=True).mean() #remove data before first year and dataarray to dataset mean = mean.sel(time = ~(mean.time.dt.year < first_year)).to_dataset() #update new_ds['d2mday'] = xr.where(new_ds.timeOffset == l, mean['d2m'], new_ds['d2mday']) return new_ds` `Second function :`python def shift_to_timezone(ds): `to = np.unique(ds.timeOffset.values) time_start = ds.time.values[0] time_end = ds.time.values[-1] + pd.Timedelta(minutes=30) #reindex ds = ds.reindex(time=pd.date_range(time_start, time_end, freq='30T')) for offset in to: temp = ds.shift(time=int(offset2)) ds = xr.where(ds.timeOffset==offset, temp, ds) return ds` resampling after shifting data ds = ds.resample(time=freq, keep_attrs=True).mean() `After opening, subseting (for testing) and adding my time offset var, it looks like this:`python <xarray.Dataset> Dimensions: (latitude: 37, longitude: 193, time: 17544) Coordinates: * latitude (latitude) float32 50.0 49.75 49.5 49.25 ... 41.5 41.25 41.0 * longitude (longitude) float32 260.0 260.25 260.5 ... 307.5 307.75 308.0 * time (time) datetime64[ns] 1979-01-01 ... 1980-12-31T23:00:00 Data variables: d2m (time, latitude, longitude) float32 dask.array<chunksize=(8760, 37, 193), meta=np.ndarray> timeOffset (latitude, longitude) float64 -5.0 -5.0 -5.0 ... -3.0 -3.0 -3.0 `Result from first function yields :`python <xarray.Dataset> Dimensions: (latitude: 37, longitude: 193, time: 731) Coordinates: * time (time) datetime64[ns] 1979-01-01 1979-01-02 ... 1980-12-31 * latitude (latitude) float32 50.0 49.75 49.5 49.25 ... 41.5 41.25 41.0 * longitude (longitude) float32 260.0 260.25 260.5 ... 307.5 307.75 308.0 Data variables: d2mday (latitude, longitude, time) float32 dask.array<chunksize=(37, 193, 1), meta=np.ndarray> timeOffset (latitude, longitude) float64 -5.0 -5.0 -5.0 ... -3.0 -3.0 -3.0 `Result from second function yields :`python <xarray.Dataset> Dimensions: (latitude: 37, longitude: 193, time: 731) Coordinates: * time (time) datetime64[ns] 1979-01-01 1979-01-02 ... 1980-12-31 * latitude (latitude) float32 50.0 49.75 49.5 49.25 ... 41.5 41.25 41.0 * longitude (longitude) float32 260.0 260.25 260.5 ... 307.5 307.75 308.0 Data variables: d2m (time, latitude, longitude) float32 dask.array<chunksize=(1, 37, 193), meta=np.ndarray> timeOffset (time, latitude, longitude) float64 -5.0 -5.0 -5.0 ... -3.0 -3.0 ``` Expected Output Using ds.load() on both solution would load the dataset onto my memory. There is enough memory on my computer to do so... Problem Description when I try to load the result from the second function, it looks like python is trying to load the dataset before even subseting it. In the memory error message the size of the object and its shape doesn't match what it is suppose to be. Here is the full traceback : ```python Traceback (most recent call last): File "<ipython-input-38-4c86a97d7d21>", line 1, in <module> new2.load() File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\dataset.py", line 651, in load evaluated_data = da.compute(lazy_data.values(), kwargs) File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\base.py", line 437, in compute results = schedule(dsk, keys, kwargs) File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\threaded.py", line 84, in get kwargs File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 486, in get_async raise_exception(exc, tb) File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 316, in reraise raise exc File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 222, in execute_task result = _execute_task(task, data) File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in _execute_task args2 = [_execute_task(a, cache) for a in args] File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in <listcomp> args2 = [_execute_task(a, cache) for a in args] File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in _execute_task args2 = [_execute_task(a, cache) for a in args] File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in <listcomp> args2 = [_execute_task(a, cache) for a in args] File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 119, in _execute_task return func(args2) File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\array\core.py", line 106, in getter c = np.asarray(c) File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 491, in array return np.asarray(self.array, dtype=dtype) File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 653, in array return np.asarray(self.array, dtype=dtype) File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 557, in array return np.asarray(array[self.key], dtype=None) File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 72, in array return self.func(self.array) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 218, in _scale_offset_decoding data = np.array(data, dtype=dtype, copy=True) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 72, in array return self.func(self.array) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 138, in _apply_mask data = np.asarray(data, dtype=dtype) File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 557, in array return np.asarray(array[self.key], dtype=None) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 73, in getitem key, self.shape, indexing.IndexingSupport.OUTER, self._getitem File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 837, in explicit_indexing_adapter result = raw_indexing_method(raw_key.tuple) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 85, in _getitem array = getitem(original_array, key) File "netCDF4_netCDF4.pyx", line 4408, in netCDF4._netCDF4.Variable.getitem File "netCDF4_netCDF4.pyx", line 5335, in netCDF4._netCDF4.Variable._get MemoryError: Unable to allocate 17.0 GiB for an array with shape (8784, 721, 1440) and data type >i2 ``` Versions Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.15.0 pandas: 1.0.4 numpy: 1.18.5 scipy: 1.3.1 netCDF4: 1.5.3 pydap: None h5netcdf: 0.8.0 h5py: 2.10.0 Nio: None zarr: None cftime: 1.1.3 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.2 cfgrib: 0.9.7.3 iris: None bottleneck: 1.3.1 dask: 2.11.0 distributed: 2.18.0 matplotlib: 3.1.3 cartopy: None seaborn: None numbagg: None setuptools: 47.1.1.post20200529 pip: 20.1.1 conda: 4.8.3 pytest: None IPython: 7.15.0 sphinx: 3.1.0	{ "url": "https://api.github.com/repos/pydata/xarray/issues/4153/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			13221727	issue

Links from other tables

1 row from issues_id in issues_labels
2 rows from issue in issue_comments