issue_comments: 1423144049

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/4610#issuecomment-1423144049	https://api.github.com/repos/pydata/xarray/issues/4610	1423144049	IC_kwDOAMm_X85U03Rx	35968931	2023-02-08T19:40:58Z	2023-02-08T20:25:04Z	MEMBER	Q: Use xhistogram approach or flox-powered approach? @dcherian recently showed how his flox package can perform histograms as groupby-like reductions. This begs the question of which approach would be better to use in a histogram function in xarray. (This is related to but better than what we had tried previously with xarray groupby and numpy_groupies.) Here's a WIP notebook comparing the two approaches. Both approaches can feasibly do: - Histograms which leave some dimensions excluded (broadcast over), - Multi-dimensional histograms (e.g. binning two different variables into one 2D bin), - Normalized histograms (return PDFs instead of counts), - Weighted histograms, - Multi-dimensional bins (as @aaronspring asks for above - but it requires work - see how to do it flox, and my stalled PR to xhistogram). Pros of using flox-powered reductions: Much less code - the flox approach is basically one call to flox. Fewer codepaths, with groupby logic and all histogram functionality flowing through the `flox.xarray_reduce` codepath. Likely clearer code than the kinda impenetrable reshaped bincount logic lurking in the depths of xhistogram. Supporting new features (e.g. multidimensional bins) should be simpler in flox because the options don't have to be propagated all the way down to the level of the `np.bincount` caller. Pros of using xhistogram's blockwise bincount approach: Absolute speed of xhistogram appears to be 3-4x higher, and that's using `numpy_groupies` in flox. Possibly flox could be faster if using numba but not sure yet. Dask graphs simplicity. Xhistogram literally uses `blockwise`, whereas the flox graphs IIUC are blockwise-like but actually a specially-constructed HLG right now. (Also important for supporting other parallel backends.) I suspect that in practice both perform similarly well after graph optimization but I have not tested this at scale, and flox's graph might be more sensitive to extra steps in the calculation like adding weights or normalising the result. Other thoughts: Flox has various clever schemes for making general chunked groupby operations run more efficiently, but I don't think histogramming would really benefit from those unless there is a strong pattern to which values likely fall in which bins, that is known a priori. Deepak's example using flox uses `pandas.IntervalIndex` to represent the bins on the result object, whereas xhistogram just returns the mid-points of the bins, throwing that info away. This seems like a cool idea on it's own, but probably requires some extra work to make sure it's handled by the indexes refactor and the plotting code. In my comparison notebook here's something I'm missing that's causing my "real example" (from xhistogram docs) to not actually use the provided weights. I suspect its something simple, any idea @dcherian? xref https://github.com/xgcm/xhistogram/issues/60, https://github.com/xgcm/xhistogram/issues/28	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		750985364

issue_comments: 1423144049

Q: Use xhistogram approach or flox-powered approach?