issues: 372244156
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
372244156 | MDU6SXNzdWUzNzIyNDQxNTY= | 2499 | Tremendous slowdown when using dask integration | 1328158 | closed | 0 | 5 | 2018-10-20T19:19:08Z | 2019-01-13T01:53:09Z | 2019-01-13T01:53:09Z | NONE | Code Sample, a copy-pastable example if possible```python def spi_gamma(data_array, scale, start_year, calibration_year_initial, calibration_year_final, periodicity):
open the precipitation NetCDF as an xarray DataSet objectdataset = xr.open_dataset(netcdf_precip, chunks={'lat': 1}) trim out all data variables from the dataset except the precipitationfor var in dataset.data_vars: if var not in arguments.var_name_precip: dataset = dataset.drop(var) get the precipitation variable as an xarray DataArray objectda_precip = dataset[var_name_precip] get the initial year of the datadata_start_year = int(str(da_precip['time'].values[0])[0:4]) stack the lat and lon dimensions into a new dimension named point, so at eachlat/lon we'll have a time series for the geospatial pointda_precip = da_precip.stack(point=('lat', 'lon')) timestep_scale = 3 group the data by lat/lon point and apply the SPI/Gamma function to each time seriesda_spi = da_precip.groupby('point').apply(spi_gamma, scale=timestep_scale, start_year=data_start_year, calibration_year_initial=1951, calibration_year_final=2010, periodicity=compute.Periodicity.monthly) unstack the array back into original dimensionsda_spi = da_spi.unstack('point') copy the original dataset since we'll be able toreuse most of the coordinates, attributes, etc.index_dataset = dataset.copy() remove all data variables from copied datasetfor var_name in index_dataset.data_vars: index_dataset = index_dataset.drop(var_name) TODO set global attributes accordingly for this new datasetcreate a new variable to contain the SPI for the scale, assign into the datasetlong_name = "Standardized Precipitation Index (Gamma distribution), "\ "{scale}-{increment}".format(scale=timestep_scale, increment=scale_increment) spi_var = xr.Variable(dims=da_spi.dims, data=da_spi, attrs={'long_name': long_name, 'valid_min': -3.09, 'valid_max': 3.09}) var_name = "spi_gamma_" + str(timestep_scale).zfill(2) index_dataset[var_name] = spi_var write the dataset as NetCDFindex_dataset.to_netcdf(output_file_base + var_name + ".nc") ``` Problem descriptionWhen I use GroupBy for split-apply-combine it works well if I don't specify a chunks argument, i.e. without dask parallelization. However, when I use a chunks argument it runs very slowly. I assume that this is because I don't yet understand how to optimally set the chunking parameters rather than this being something under the covers goobering up the processing with dask arrays (i.e. I doubt that this is a bug with xarray/dask). I have tried modifying my code by replacing all numpy arrays with dask arrays, but this has been problematic since there are not dask equivalents for some of the numpy functions used therein. Before I go too much further down that path I wanted to post here to see if there is something else that I may be overlooking which would make that effort unnecessary. My apologies if this is better posted to StackOverflow rather than as an issue, and if so I can post there instead. My attempt at making this work is in a feature branch in my project's Git repository, I mention this because the above code is not a minimally working example but is included nevertheless to give a summary of what's happening at the top layer where I'm using xarray explicitly. If more code is required after a cursory look at this then I will provide, but hopefully I'm making a rookie mistake that once rectified will fix this. In case it matters I have been launching my code from within PyCharm (both run and debug, with the same results), but my assumption has been that this is irrelevant and it should work the same at command line. Thanks in advance for any suggestions or insight. Expected OutputOutput of
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/2499/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |