home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 372244156

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
372244156 MDU6SXNzdWUzNzIyNDQxNTY= 2499 Tremendous slowdown when using dask integration 1328158 closed 0     5 2018-10-20T19:19:08Z 2019-01-13T01:53:09Z 2019-01-13T01:53:09Z NONE      

Code Sample, a copy-pastable example if possible

```python def spi_gamma(data_array, scale, start_year, calibration_year_initial, calibration_year_final, periodicity):

original_shape = data_array.shape
spi = indices.spi(data_array.values.squeeze(),
                  scale,
                  indices.Distribution.gamma,
                  start_year,
                  calibration_year_initial,
                  calibration_year_final,
                  periodicity)
data_array.values = np.reshape(spi, newshape=original_shape)

return data_array

open the precipitation NetCDF as an xarray DataSet object

dataset = xr.open_dataset(netcdf_precip, chunks={'lat': 1})

trim out all data variables from the dataset except the precipitation

for var in dataset.data_vars: if var not in arguments.var_name_precip: dataset = dataset.drop(var)

get the precipitation variable as an xarray DataArray object

da_precip = dataset[var_name_precip]

get the initial year of the data

data_start_year = int(str(da_precip['time'].values[0])[0:4])

stack the lat and lon dimensions into a new dimension named point, so at each

lat/lon we'll have a time series for the geospatial point

da_precip = da_precip.stack(point=('lat', 'lon'))

timestep_scale = 3

group the data by lat/lon point and apply the SPI/Gamma function to each time series

da_spi = da_precip.groupby('point').apply(spi_gamma, scale=timestep_scale, start_year=data_start_year, calibration_year_initial=1951, calibration_year_final=2010, periodicity=compute.Periodicity.monthly)

unstack the array back into original dimensions

da_spi = da_spi.unstack('point')

copy the original dataset since we'll be able to

reuse most of the coordinates, attributes, etc.

index_dataset = dataset.copy()

remove all data variables from copied dataset

for var_name in index_dataset.data_vars: index_dataset = index_dataset.drop(var_name)

TODO set global attributes accordingly for this new dataset

create a new variable to contain the SPI for the scale, assign into the dataset

long_name = "Standardized Precipitation Index (Gamma distribution), "\ "{scale}-{increment}".format(scale=timestep_scale, increment=scale_increment) spi_var = xr.Variable(dims=da_spi.dims, data=da_spi, attrs={'long_name': long_name, 'valid_min': -3.09, 'valid_max': 3.09}) var_name = "spi_gamma_" + str(timestep_scale).zfill(2) index_dataset[var_name] = spi_var

write the dataset as NetCDF

index_dataset.to_netcdf(output_file_base + var_name + ".nc") ```

Problem description

When I use GroupBy for split-apply-combine it works well if I don't specify a chunks argument, i.e. without dask parallelization. However, when I use a chunks argument it runs very slowly. I assume that this is because I don't yet understand how to optimally set the chunking parameters rather than this being something under the covers goobering up the processing with dask arrays (i.e. I doubt that this is a bug with xarray/dask). I have tried modifying my code by replacing all numpy arrays with dask arrays, but this has been problematic since there are not dask equivalents for some of the numpy functions used therein. Before I go too much further down that path I wanted to post here to see if there is something else that I may be overlooking which would make that effort unnecessary. My apologies if this is better posted to StackOverflow rather than as an issue, and if so I can post there instead.

My attempt at making this work is in a feature branch in my project's Git repository, I mention this because the above code is not a minimally working example but is included nevertheless to give a summary of what's happening at the top layer where I'm using xarray explicitly. If more code is required after a cursory look at this then I will provide, but hopefully I'm making a rookie mistake that once rectified will fix this.

In case it matters I have been launching my code from within PyCharm (both run and debug, with the same results), but my assumption has been that this is irrelevant and it should work the same at command line.

Thanks in advance for any suggestions or insight.

Expected Output

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None xarray: 0.10.9 pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: None h5py: 2.8.0 Nio: None zarr: None cftime: 1.0.0b1 PseudonetCDF: None rasterio: None iris: None bottleneck: None cyordereddict: None dask: 0.19.3 distributed: 1.23.3 matplotlib: None cartopy: None seaborn: None setuptools: 39.2.0 pip: 10.0.1 conda: 4.5.11 pytest: None IPython: 7.0.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2499/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 5 rows from issue in issue_comments
Powered by Datasette · Queries took 0.746ms · About: xarray-datasette