home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 286502400

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1308#issuecomment-286502400 https://api.github.com/repos/pydata/xarray/issues/1308 286502400 MDEyOklzc3VlQ29tbWVudDI4NjUwMjQwMA== 1217238 2017-03-14T17:43:13Z 2017-03-14T17:43:13Z MEMBER

We currently do all the groupby handling ourselves, which means that when you group over smaller units the dask graph gets bigger and each of the tasks gets smaller. Given that each chunk in the grouped data is only about ~250,000 elements, it's not surprising that things get a bit slower -- that's near the point where Python overhead starts to get significant.

It would be useful to benchmark graph creation and execution separately (especially using dask-distributed's profiling tools) to understand where the slow-down is.

One thing that might help quite a bit in cases like this where the individual groups are small is to rewrite xarray's groupby to do some groupby operations inside dask, rather than in a loop outside of dask. That would allow executing tasks on bigger chunks of arrays at once, which could significantly reduce scheduler overhead.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  214088387
Powered by Datasette · Queries took 0.567ms · About: xarray-datasette