issue_comments: 789078512

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/2237#issuecomment-789078512	https://api.github.com/repos/pydata/xarray/issues/2237	789078512	MDEyOklzc3VlQ29tbWVudDc4OTA3ODUxMg==	2448579	2021-03-02T17:29:51Z	2021-03-02T18:03:17Z	MEMBER	I think the behaviour in Ryan's most recent comment is a consequence of groupby.mean being `python results = [] for group_idx in group_indices: # one group per year group = ds.isel(group_idx) # (SPLIT) results.append(group.mean()) # (APPLY) return xr.concat(results, dim="year") # COMBINE results in one chunk per year (one chunk per element in results)` I think the fundamental question is: Is it really possible for dask to recognize that the chunk structure after the `combine` step could be consolidated with an arbitrary number of `apply` steps in the middle ? OR When a computation maps a single chunk to many chunks, should dask consolidate the output chunks (using `array.chunk-size`)? We can explicitly ask for consolidation of chunks by saying the output should be chunked `5` along `year` `python dask.config.set({"optimization.fuse.ave-width": 6}) # note > 5 ( ds.foo.groupby("year") .mean(dim="time") .chunk({"year": 5}) # really important, why and how would dask choose this automatically/ .data.visualize(optimize_graph=False) )` Then if we set `optimization.fuse.ave-width` appropriately, we get the graph we want after optimization `python dask.config.set({"optimization.fuse.ave-width": 6}) ( ds.foo.groupby("year") .mean(dim="time") .chunk({"year": 5}) # really important .data.visualize(optimize_graph=True) )` Can we make dask recognize that the 5 getitem tasks from input-chunk-0, at the bottom of each tower, can be fused to a single task? In that case, fuse the 5 getitem tasks and "propagate" that fusion up the tower. I guess another failure here is that when `fuse.ave-width` is 3 (< width of tower), why isn't dask fusing to make three "sub-towers" per-tower? Even that would help reduce number of tasks. `dask.config.set({"optimization.fuse.ave-width": 3}) ( ds.foo.groupby("year") .mean(dim="time") .chunk({"year": 5}) # really important .data.visualize(optimize_graph=True) )`	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		333312849