home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 789078512

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/2237#issuecomment-789078512 https://api.github.com/repos/pydata/xarray/issues/2237 789078512 MDEyOklzc3VlQ29tbWVudDc4OTA3ODUxMg== 2448579 2021-03-02T17:29:51Z 2021-03-02T18:03:17Z MEMBER

I think the behaviour in Ryan's most recent comment is a consequence of groupby.mean being python results = [] for group_idx in group_indices: # one group per year group = ds.isel(group_idx) # (SPLIT) results.append(group.mean()) # (APPLY) return xr.concat(results, dim="year") # COMBINE results in one chunk per year (one chunk per element in results)

I think the fundamental question is: Is it really possible for dask to recognize that the chunk structure after the combine step could be consolidated with an arbitrary number of apply steps in the middle ? OR When a computation maps a single chunk to many chunks, should dask consolidate the output chunks (using array.chunk-size)?

We can explicitly ask for consolidation of chunks by saying the output should be chunked 5 along year python dask.config.set({"optimization.fuse.ave-width": 6}) # note > 5 ( ds.foo.groupby("year") .mean(dim="time") .chunk({"year": 5}) # really important, why and how would dask choose this automatically/ .data.visualize(optimize_graph=False) )

Then if we set optimization.fuse.ave-width appropriately, we get the graph we want after optimization python dask.config.set({"optimization.fuse.ave-width": 6}) ( ds.foo.groupby("year") .mean(dim="time") .chunk({"year": 5}) # really important .data.visualize(optimize_graph=True) )

Can we make dask recognize that the 5 getitem tasks from input-chunk-0, at the bottom of each tower, can be fused to a single task? In that case, fuse the 5 getitem tasks and "propagate" that fusion up the tower.

I guess another failure here is that when fuse.ave-width is 3 (< width of tower), why isn't dask fusing to make three "sub-towers" per-tower? Even that would help reduce number of tasks. dask.config.set({"optimization.fuse.ave-width": 3}) ( ds.foo.groupby("year") .mean(dim="time") .chunk({"year": 5}) # really important .data.visualize(optimize_graph=True) )

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  333312849
Powered by Datasette · Queries took 79.786ms · About: xarray-datasette