home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 638414463

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
638414463 MDU6SXNzdWU2Mzg0MTQ0NjM= 4153 Certain dataset methods on chunked arrays seem to interfere with loading/writing files 66918146 open 0     2 2020-06-14T19:26:32Z 2022-04-18T05:31:54Z   NONE      

I have a big hourly based dataset that I need to resample on a daily basis. I am trying to implement a faster routine of that resample because it covers multiple timezones and therefore can't only use a simple "xarray.dataset.resample" method. In short both of my resampling functions:

Function 1 (old one): I use an added time offset variable (that I generated and added myself) that are simple floats. According to the offset, I re-assign my time coordinate with the timedelta applied, resample my dataset (hourly -> daily) and use xr.where to assign values.

Function 2 (new one I want to use): I reindex my dataset with a foward fill method so it becomes on a 30min basis (I do that because I have offset of 2.5 hours and shift only take integers). I then shift my dataset accordingly to offset*2, use xr.where to assign values and resample at the end.

I want to use the second function because I am opening my data with "open_mfdataset" and the second function runs about 5-10 times faster. The problem comes when I want to access/load/write my data...the result yield by the first function (using only resample) is able to load onto the memory but the result yielded by the second function (using shift) yields a memory error.

I don't know a lot about dask, but maybe the shifting method is creating some kind of bug with the scheduler..?

MCVE Code Sample

First function: ```python def daily_mean_vars(ds, freq): """ compute daily mean, max and min of xarray dataset object Args: ds (obj): dataset object f (str): frequency returns: new_ds (obj): updated dataset object """ first_year = ds.time[0].dt.year.values time = ds.time.to_index() to = np.unique(ds.timeOffset.values) new_ds = ds['d2m'].resample(time=freq, keep_attrs=True).mean() new_ds = new_ds.to_dataset() new_ds = new_ds.rename({'d2m': 'd2mday']})

for l in tqdm(to):
    #convert time to timezone
    tt = time + pd.Timedelta(hours = l)

    #update ds index
    ds = ds.assign_coords(time = tt)

    #resample
    mean = ds['d2m'].resample(time=f, keep_attrs=True).mean()

    #remove data before first year and dataarray to dataset
    mean = mean.sel(time = ~(mean.time.dt.year < first_year)).to_dataset()

    #update
    new_ds['d2mday'] = xr.where(new_ds.timeOffset == l, mean['d2m'], new_ds['d2mday'])

return new_ds

Second function :python def shift_to_timezone(ds):

to = np.unique(ds.timeOffset.values)
time_start = ds.time.values[0]
time_end = ds.time.values[-1] + pd.Timedelta(minutes=30)

#reindex
ds = ds.reindex(time=pd.date_range(time_start, time_end, freq='30T'))

for offset in to:
    temp = ds.shift(time=int(offset*2))
    ds = xr.where(ds.timeOffset==offset, temp, ds)

return ds

resampling after shifting data

ds = ds.resample(time=freq, keep_attrs=True).mean() After opening, subseting (for testing) and adding my time offset var, it looks like this:python <xarray.Dataset> Dimensions: (latitude: 37, longitude: 193, time: 17544) Coordinates: * latitude (latitude) float32 50.0 49.75 49.5 49.25 ... 41.5 41.25 41.0 * longitude (longitude) float32 260.0 260.25 260.5 ... 307.5 307.75 308.0 * time (time) datetime64[ns] 1979-01-01 ... 1980-12-31T23:00:00 Data variables: d2m (time, latitude, longitude) float32 dask.array<chunksize=(8760, 37, 193), meta=np.ndarray> timeOffset (latitude, longitude) float64 -5.0 -5.0 -5.0 ... -3.0 -3.0 -3.0 Result from first function yields :python <xarray.Dataset> Dimensions: (latitude: 37, longitude: 193, time: 731) Coordinates: * time (time) datetime64[ns] 1979-01-01 1979-01-02 ... 1980-12-31 * latitude (latitude) float32 50.0 49.75 49.5 49.25 ... 41.5 41.25 41.0 * longitude (longitude) float32 260.0 260.25 260.5 ... 307.5 307.75 308.0 Data variables: d2mday (latitude, longitude, time) float32 dask.array<chunksize=(37, 193, 1), meta=np.ndarray> timeOffset (latitude, longitude) float64 -5.0 -5.0 -5.0 ... -3.0 -3.0 -3.0 Result from second function yields :python <xarray.Dataset> Dimensions: (latitude: 37, longitude: 193, time: 731) Coordinates: * time (time) datetime64[ns] 1979-01-01 1979-01-02 ... 1980-12-31 * latitude (latitude) float32 50.0 49.75 49.5 49.25 ... 41.5 41.25 41.0 * longitude (longitude) float32 260.0 260.25 260.5 ... 307.5 307.75 308.0 Data variables: d2m (time, latitude, longitude) float32 dask.array<chunksize=(1, 37, 193), meta=np.ndarray> timeOffset (time, latitude, longitude) float64 -5.0 -5.0 -5.0 ... -3.0 -3.0 ```

Expected Output

Using ds.load() on both solution would load the dataset onto my memory. There is enough memory on my computer to do so...

Problem Description

when I try to load the result from the second function, it looks like python is trying to load the dataset before even subseting it. In the memory error message the size of the object and its shape doesn't match what it is suppose to be. Here is the full traceback : ```python Traceback (most recent call last):

File "<ipython-input-38-4c86a97d7d21>", line 1, in <module> new2.load()

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\dataset.py", line 651, in load evaluated_data = da.compute(lazy_data.values(), *kwargs)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\base.py", line 437, in compute results = schedule(dsk, keys, **kwargs)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\threaded.py", line 84, in get **kwargs

File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 486, in get_async raise_exception(exc, tb)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 316, in reraise raise exc

File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 222, in execute_task result = _execute_task(task, data)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in _execute_task args2 = [_execute_task(a, cache) for a in args]

File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in <listcomp> args2 = [_execute_task(a, cache) for a in args]

File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in _execute_task args2 = [_execute_task(a, cache) for a in args]

File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in <listcomp> args2 = [_execute_task(a, cache) for a in args]

File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 119, in _execute_task return func(*args2)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\array\core.py", line 106, in getter c = np.asarray(c)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 491, in array return np.asarray(self.array, dtype=dtype)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 653, in array return np.asarray(self.array, dtype=dtype)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 557, in array return np.asarray(array[self.key], dtype=None)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 72, in array return self.func(self.array)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 218, in _scale_offset_decoding data = np.array(data, dtype=dtype, copy=True)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 72, in array return self.func(self.array)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 138, in _apply_mask data = np.asarray(data, dtype=dtype)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 557, in array return np.asarray(array[self.key], dtype=None)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 73, in getitem key, self.shape, indexing.IndexingSupport.OUTER, self._getitem

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 837, in explicit_indexing_adapter result = raw_indexing_method(raw_key.tuple)

File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 85, in _getitem array = getitem(original_array, key)

File "netCDF4_netCDF4.pyx", line 4408, in netCDF4._netCDF4.Variable.getitem

File "netCDF4_netCDF4.pyx", line 5335, in netCDF4._netCDF4.Variable._get

MemoryError: Unable to allocate 17.0 GiB for an array with shape (8784, 721, 1440) and data type >i2 ```

Versions

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.15.0 pandas: 1.0.4 numpy: 1.18.5 scipy: 1.3.1 netCDF4: 1.5.3 pydap: None h5netcdf: 0.8.0 h5py: 2.10.0 Nio: None zarr: None cftime: 1.1.3 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.2 cfgrib: 0.9.7.3 iris: None bottleneck: 1.3.1 dask: 2.11.0 distributed: 2.18.0 matplotlib: 3.1.3 cartopy: None seaborn: None numbagg: None setuptools: 47.1.1.post20200529 pip: 20.1.1 conda: 4.8.3 pytest: None IPython: 7.15.0 sphinx: 3.1.0
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4153/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 2 rows from issue in issue_comments
Powered by Datasette · Queries took 0.562ms · About: xarray-datasette