issues: 372244156

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
372244156	MDU6SXNzdWUzNzIyNDQxNTY=	2499	Tremendous slowdown when using dask integration	1328158	closed	0			5	2018-10-20T19:19:08Z	2019-01-13T01:53:09Z	2019-01-13T01:53:09Z	NONE				Code Sample, a copy-pastable example if possible ```python def spi_gamma(data_array, scale, start_year, calibration_year_initial, calibration_year_final, periodicity): `original_shape = data_array.shape spi = indices.spi(data_array.values.squeeze(), scale, indices.Distribution.gamma, start_year, calibration_year_initial, calibration_year_final, periodicity) data_array.values = np.reshape(spi, newshape=original_shape) return data_array` open the precipitation NetCDF as an xarray DataSet object dataset = xr.open_dataset(netcdf_precip, chunks={'lat': 1}) trim out all data variables from the dataset except the precipitation for var in dataset.data_vars: if var not in arguments.var_name_precip: dataset = dataset.drop(var) get the precipitation variable as an xarray DataArray object da_precip = dataset[var_name_precip] get the initial year of the data data_start_year = int(str(da_precip['time'].values[0])[0:4]) stack the lat and lon dimensions into a new dimension named point, so at each lat/lon we'll have a time series for the geospatial point da_precip = da_precip.stack(point=('lat', 'lon')) timestep_scale = 3 group the data by lat/lon point and apply the SPI/Gamma function to each time series da_spi = da_precip.groupby('point').apply(spi_gamma, scale=timestep_scale, start_year=data_start_year, calibration_year_initial=1951, calibration_year_final=2010, periodicity=compute.Periodicity.monthly) unstack the array back into original dimensions da_spi = da_spi.unstack('point') copy the original dataset since we'll be able to reuse most of the coordinates, attributes, etc. index_dataset = dataset.copy() remove all data variables from copied dataset for var_name in index_dataset.data_vars: index_dataset = index_dataset.drop(var_name) TODO set global attributes accordingly for this new dataset create a new variable to contain the SPI for the scale, assign into the dataset long_name = "Standardized Precipitation Index (Gamma distribution), "\ "{scale}-{increment}".format(scale=timestep_scale, increment=scale_increment) spi_var = xr.Variable(dims=da_spi.dims, data=da_spi, attrs={'long_name': long_name, 'valid_min': -3.09, 'valid_max': 3.09}) var_name = "spi_gamma_" + str(timestep_scale).zfill(2) index_dataset[var_name] = spi_var write the dataset as NetCDF index_dataset.to_netcdf(output_file_base + var_name + ".nc") ``` Problem description When I use GroupBy for split-apply-combine it works well if I don't specify a chunks argument, i.e. without dask parallelization. However, when I use a chunks argument it runs very slowly. I assume that this is because I don't yet understand how to optimally set the chunking parameters rather than this being something under the covers goobering up the processing with dask arrays (i.e. I doubt that this is a bug with xarray/dask). I have tried modifying my code by replacing all numpy arrays with dask arrays, but this has been problematic since there are not dask equivalents for some of the numpy functions used therein. Before I go too much further down that path I wanted to post here to see if there is something else that I may be overlooking which would make that effort unnecessary. My apologies if this is better posted to StackOverflow rather than as an issue, and if so I can post there instead. My attempt at making this work is in a feature branch in my project's Git repository, I mention this because the above code is not a minimally working example but is included nevertheless to give a summary of what's happening at the top layer where I'm using xarray explicitly. If more code is required after a cursory look at this then I will provide, but hopefully I'm making a rookie mistake that once rectified will fix this. In case it matters I have been launching my code from within PyCharm (both run and debug, with the same results), but my assumption has been that this is irrelevant and it should work the same at command line. Thanks in advance for any suggestions or insight. Expected Output Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None xarray: 0.10.9 pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: None h5py: 2.8.0 Nio: None zarr: None cftime: 1.0.0b1 PseudonetCDF: None rasterio: None iris: None bottleneck: None cyordereddict: None dask: 0.19.3 distributed: 1.23.3 matplotlib: None cartopy: None seaborn: None setuptools: 39.2.0 pip: 10.0.1 conda: 4.5.11 pytest: None IPython: 7.0.1 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2499/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

0 rows from issues_id in issues_labels
5 rows from issue in issue_comments