home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where issue = 372244156 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • monocongo 2
  • Zac-HD 2
  • shoyer 1

author_association 3

  • CONTRIBUTOR 2
  • NONE 2
  • MEMBER 1

issue 1

  • Tremendous slowdown when using dask integration · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
432846749 https://github.com/pydata/xarray/issues/2499#issuecomment-432846749 https://api.github.com/repos/pydata/xarray/issues/2499 MDEyOklzc3VlQ29tbWVudDQzMjg0Njc0OQ== monocongo 1328158 2018-10-24T22:14:08Z 2018-10-24T22:14:08Z NONE

I have had some success using apply_ufunc in tandem with multiprocessing. Apparently, I can't (seamlessly) use dask arrays in place of numpy arrays within the functions where I am performing my computations, as it's not possible to assign values into dask arrays using integer indexing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Tremendous slowdown when using dask integration 372244156
431718845 https://github.com/pydata/xarray/issues/2499#issuecomment-431718845 https://api.github.com/repos/pydata/xarray/issues/2499 MDEyOklzc3VlQ29tbWVudDQzMTcxODg0NQ== Zac-HD 12229877 2018-10-22T00:50:22Z 2018-10-22T00:50:22Z CONTRIBUTOR

I'd also try to find a way to use a groupby or apply_along_axis without stacking and unstacking the data, and to choose chunks that match the layout on disk - i.e. try lon=1 if the order is time, lat, lon. If the time observations are not contiguous in memory, it's probably worth reshaping the whole array and writing it back to disk up front.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Tremendous slowdown when using dask integration 372244156
431717076 https://github.com/pydata/xarray/issues/2499#issuecomment-431717076 https://api.github.com/repos/pydata/xarray/issues/2499 MDEyOklzc3VlQ29tbWVudDQzMTcxNzA3Ng== shoyer 1217238 2018-10-22T00:25:37Z 2018-10-22T00:25:37Z MEMBER

I think the problem is actually this line: da_precip.groupby('point').apply(spi_gamma

The problem is that spi_gamma accesses/sets data_array.values, which mean it's loading data into NumPy arrays. You need to work entirely with .data / dask arrays in order for dask to provide useful speedups. I don't know what indices.spi is doing, but you can probably parallelize it either with dask.array.map_blocks or xarray's apply_ufunc.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Tremendous slowdown when using dask integration 372244156
431684522 https://github.com/pydata/xarray/issues/2499#issuecomment-431684522 https://api.github.com/repos/pydata/xarray/issues/2499 MDEyOklzc3VlQ29tbWVudDQzMTY4NDUyMg== monocongo 1328158 2018-10-21T16:49:35Z 2018-10-21T19:43:27Z NONE

Thanks, Zac.

I have used various options with the chunks argument, e.g. chunks={'lat': 10, 'lon': 10}, all of which appear to have a similar effect. Maybe I just haven't yet hit upon the sweet spot chunk sizes?

Is there a rule-of-thumb approach to determining the chunk sizes for a dataset? Perhaps before setting the chunk sizes I could open the dataset to poll the dimensions of the variables and based on that come up with reasonable chunk sizes, or none at all if the dataset is reasonably small?

My computations typically use a full time series per lat/lon point, so my assumption has been that I don't want to use chunking on the time dimension -- is this correct?

I have been testing this code using two versions of a precipitation dataset, the full resolution is (time=1481, lat=596, lon=1385) and the low-resolution version (for faster tests) is (time=1466, lat=38, lon=87). Results of ncdump and repr(xr.open_dataset(netcdf_precip)) are below.

``` $ ncdump -h nclimgrid_prcp.nc netcdf nclimgrid_prcp { dimensions: time = UNLIMITED ; // (1481 currently) lat = 596 ; lon = 1385 ; variables: int time(time) ; time:long_name = "Time, in monthly increments" ; time:standard_name = "time" ; time:calendar = "gregorian" ; time:units = "days since 1800-01-01 00:00:00" ; time:axis = "T" ; float lat(lat) ; lat:standard_name = "latitude" ; lat:long_name = "Latitude" ; lat:units = "degrees_north" ; lat:axis = "Y" ; lat:valid_min = 24.56253f ; lat:valid_max = 49.3542f ; float lon(lon) ; lon:standard_name = "longitude" ; lon:long_name = "Longitude" ; lon:units = "degrees_east" ; lon:axis = "X" ; lon:valid_min = -124.6875f ; lon:valid_max = -67.02084f ; float prcp(time, lat, lon) ; prcp:_FillValue = NaNf ; prcp:least_significant_digit = 3LL ; prcp:valid_min = 0.f ; prcp:coordinates = "time lat lon" ; prcp:long_name = "Precipitation, monthly total" ; prcp:standard_name = "precipitation_amount" ; prcp:references = "GHCN-Monthly Version 3 (Vose et al. 2011), NCEI/NOAA, https://www.ncdc.noaa.gov/ghcnm/v3.php" ; prcp:units = "millimeter" ; prcp:valid_max = 2000.f ;

// global attributes: :date_created = "2018-02-15 10:29:25.485927" ; :date_modified = "2018-02-15 10:29:25.486042" ; :Conventions = "CF-1.6, ACDD-1.3" ; :ncei_template_version = "NCEI_NetCDF_Grid_Template_v2.0" ; :title = "nClimGrid" ; :naming_authority = "gov.noaa.ncei" ; :standard_name_vocabulary = "Standard Name Table v35" ; :institution = "National Centers for Environmental Information (NCEI), NOAA, Department of Commerce" ; :geospatial_lat_min = 24.56253f ; :geospatial_lat_max = 49.3542f ; :geospatial_lon_min = -124.6875f ; :geospatial_lon_max = -67.02084f ; :geospatial_lat_units = "degrees_north" ; :geospatial_lon_units = "degrees_east" ; }

/ repr(ds) below: / <xarray.Dataset> Dimensions: (lat: 596, lon: 1385, time: 1481) Coordinates: * time (time) datetime64[ns] 1895-01-01 1895-02-01 ... 2018-05-01 * lat (lat) float32 49.3542 49.312534 49.270866 ... 24.6042 24.562532 * lon (lon) float32 -124.6875 -124.645836 ... -67.0625 -67.020836 Data variables: prcp (time, lat, lon) float32 ... Attributes: date_created: 2018-02-15 10:29:25.485927 date_modified: 2018-02-15 10:29:25.486042 Conventions: CF-1.6, ACDD-1.3 ncei_template_version: NCEI_NetCDF_Grid_Template_v2.0 title: nClimGrid naming_authority: gov.noaa.ncei standard_name_vocabulary: Standard Name Table v35 institution: National Centers for Environmental Information... geospatial_lat_min: 24.562532 geospatial_lat_max: 49.3542 geospatial_lon_min: -124.6875 geospatial_lon_max: -67.020836 geospatial_lat_units: degrees_north geospatial_lon_units: degrees_east ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Tremendous slowdown when using dask integration 372244156
431657200 https://github.com/pydata/xarray/issues/2499#issuecomment-431657200 https://api.github.com/repos/pydata/xarray/issues/2499 MDEyOklzc3VlQ29tbWVudDQzMTY1NzIwMA== Zac-HD 12229877 2018-10-21T10:30:23Z 2018-10-21T10:30:23Z CONTRIBUTOR

dataset = xr.open_dataset(netcdf_precip, chunks={'lat': 1})

This makes me really suspicious - lat=1 is a very very small chunk size, and completely unchunked in time and lon. Without knowing anything else, I'd try chunks=dict(lat=200, lon=200) or higher depending on the time dim - Dask is most efficient with chunks of around 10MB for most workloads.

This all also depends on the data layout on disk too - can you share repr(xr.open_dataset(netcdf_precip))? What does ncdump say?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Tremendous slowdown when using dask integration 372244156

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 12.531ms · About: xarray-datasette