issues: 638414463
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
638414463 | MDU6SXNzdWU2Mzg0MTQ0NjM= | 4153 | Certain dataset methods on chunked arrays seem to interfere with loading/writing files | 66918146 | open | 0 | 2 | 2020-06-14T19:26:32Z | 2022-04-18T05:31:54Z | NONE | I have a big hourly based dataset that I need to resample on a daily basis. I am trying to implement a faster routine of that resample because it covers multiple timezones and therefore can't only use a simple "xarray.dataset.resample" method. In short both of my resampling functions: Function 1 (old one): I use an added time offset variable (that I generated and added myself) that are simple floats. According to the offset, I re-assign my time coordinate with the timedelta applied, resample my dataset (hourly -> daily) and use xr.where to assign values. Function 2 (new one I want to use): I reindex my dataset with a foward fill method so it becomes on a 30min basis (I do that because I have offset of 2.5 hours and shift only take integers). I then shift my dataset accordingly to offset*2, use xr.where to assign values and resample at the end. I want to use the second function because I am opening my data with "open_mfdataset" and the second function runs about 5-10 times faster. The problem comes when I want to access/load/write my data...the result yield by the first function (using only resample) is able to load onto the memory but the result yielded by the second function (using shift) yields a memory error. I don't know a lot about dask, but maybe the shifting method is creating some kind of bug with the scheduler..? MCVE Code SampleFirst function: ```python def daily_mean_vars(ds, freq): """ compute daily mean, max and min of xarray dataset object Args: ds (obj): dataset object f (str): frequency returns: new_ds (obj): updated dataset object """ first_year = ds.time[0].dt.year.values time = ds.time.to_index() to = np.unique(ds.timeOffset.values) new_ds = ds['d2m'].resample(time=freq, keep_attrs=True).mean() new_ds = new_ds.to_dataset() new_ds = new_ds.rename({'d2m': 'd2mday']})
resampling after shifting datads = ds.resample(time=freq, keep_attrs=True).mean()
Expected OutputUsing ds.load() on both solution would load the dataset onto my memory. There is enough memory on my computer to do so... Problem Descriptionwhen I try to load the result from the second function, it looks like python is trying to load the dataset before even subseting it. In the memory error message the size of the object and its shape doesn't match what it is suppose to be. Here is the full traceback : ```python Traceback (most recent call last): File "<ipython-input-38-4c86a97d7d21>", line 1, in <module> new2.load() File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\dataset.py", line 651, in load evaluated_data = da.compute(lazy_data.values(), *kwargs) File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\base.py", line 437, in compute results = schedule(dsk, keys, **kwargs) File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\threaded.py", line 84, in get **kwargs File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 486, in get_async raise_exception(exc, tb) File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 316, in reraise raise exc File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 222, in execute_task result = _execute_task(task, data) File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in _execute_task args2 = [_execute_task(a, cache) for a in args] File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in <listcomp> args2 = [_execute_task(a, cache) for a in args] File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in _execute_task args2 = [_execute_task(a, cache) for a in args] File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in <listcomp> args2 = [_execute_task(a, cache) for a in args] File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 119, in _execute_task return func(*args2) File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\array\core.py", line 106, in getter c = np.asarray(c) File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 491, in array return np.asarray(self.array, dtype=dtype) File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 653, in array return np.asarray(self.array, dtype=dtype) File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 557, in array return np.asarray(array[self.key], dtype=None) File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 72, in array return self.func(self.array) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 218, in _scale_offset_decoding data = np.array(data, dtype=dtype, copy=True) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 72, in array return self.func(self.array) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 138, in _apply_mask data = np.asarray(data, dtype=dtype) File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 557, in array return np.asarray(array[self.key], dtype=None) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 73, in getitem key, self.shape, indexing.IndexingSupport.OUTER, self._getitem File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 837, in explicit_indexing_adapter result = raw_indexing_method(raw_key.tuple) File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 85, in _getitem array = getitem(original_array, key) File "netCDF4_netCDF4.pyx", line 4408, in netCDF4._netCDF4.Variable.getitem File "netCDF4_netCDF4.pyx", line 5335, in netCDF4._netCDF4.Variable._get MemoryError: Unable to allocate 17.0 GiB for an array with shape (8784, 721, 1440) and data type >i2 ``` VersionsOutput of <tt>xr.show_versions()</tt>INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.15.0 pandas: 1.0.4 numpy: 1.18.5 scipy: 1.3.1 netCDF4: 1.5.3 pydap: None h5netcdf: 0.8.0 h5py: 2.10.0 Nio: None zarr: None cftime: 1.1.3 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.2 cfgrib: 0.9.7.3 iris: None bottleneck: 1.3.1 dask: 2.11.0 distributed: 2.18.0 matplotlib: 3.1.3 cartopy: None seaborn: None numbagg: None setuptools: 47.1.1.post20200529 pip: 20.1.1 conda: 4.8.3 pytest: None IPython: 7.15.0 sphinx: 3.1.0 |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/4153/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
13221727 | issue |