home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 1423144049

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4610#issuecomment-1423144049 https://api.github.com/repos/pydata/xarray/issues/4610 1423144049 IC_kwDOAMm_X85U03Rx 35968931 2023-02-08T19:40:58Z 2023-02-08T20:25:04Z MEMBER

Q: Use xhistogram approach or flox-powered approach?

@dcherian recently showed how his flox package can perform histograms as groupby-like reductions. This begs the question of which approach would be better to use in a histogram function in xarray.

(This is related to but better than what we had tried previously with xarray groupby and numpy_groupies.)

Here's a WIP notebook comparing the two approaches.

Both approaches can feasibly do: - Histograms which leave some dimensions excluded (broadcast over), - Multi-dimensional histograms (e.g. binning two different variables into one 2D bin), - Normalized histograms (return PDFs instead of counts), - Weighted histograms, - Multi-dimensional bins (as @aaronspring asks for above - but it requires work - see how to do it flox, and my stalled PR to xhistogram).

Pros of using flox-powered reductions:

  • Much less code - the flox approach is basically one call to flox.
  • Fewer codepaths, with groupby logic and all histogram functionality flowing through the flox.xarray_reduce codepath.
  • Likely clearer code than the kinda impenetrable reshaped bincount logic lurking in the depths of xhistogram.
  • Supporting new features (e.g. multidimensional bins) should be simpler in flox because the options don't have to be propagated all the way down to the level of the np.bincount caller.

Pros of using xhistogram's blockwise bincount approach:

  • Absolute speed of xhistogram appears to be 3-4x higher, and that's using numpy_groupies in flox. Possibly flox could be faster if using numba but not sure yet.
  • Dask graphs simplicity. Xhistogram literally uses blockwise, whereas the flox graphs IIUC are blockwise-like but actually a specially-constructed HLG right now. (Also important for supporting other parallel backends.) I suspect that in practice both perform similarly well after graph optimization but I have not tested this at scale, and flox's graph might be more sensitive to extra steps in the calculation like adding weights or normalising the result.

Other thoughts:

  • Flox has various clever schemes for making general chunked groupby operations run more efficiently, but I don't think histogramming would really benefit from those unless there is a strong pattern to which values likely fall in which bins, that is known a priori.
  • Deepak's example using flox uses pandas.IntervalIndex to represent the bins on the result object, whereas xhistogram just returns the mid-points of the bins, throwing that info away. This seems like a cool idea on it's own, but probably requires some extra work to make sure it's handled by the indexes refactor and the plotting code.
  • In my comparison notebook here's something I'm missing that's causing my "real example" (from xhistogram docs) to not actually use the provided weights. I suspect its something simple, any idea @dcherian?

xref https://github.com/xgcm/xhistogram/issues/60, https://github.com/xgcm/xhistogram/issues/28

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  750985364
Powered by Datasette · Queries took 0.862ms · About: xarray-datasette