home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 844322268

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4610#issuecomment-844322268 https://api.github.com/repos/pydata/xarray/issues/4610 844322268 MDEyOklzc3VlQ29tbWVudDg0NDMyMjI2OA== 35968931 2021-05-19T17:37:49Z 2021-05-28T14:07:25Z MEMBER

Update on this: in a PR to xhistogram we have a rough proof-of-principle for a dask-parallelized, axis-aware implementation of N-dimensional histogram calculations, suitable for eventually integrating into xarray.

We still need to complete the work over in xhistogram, but for now I want to suggest what I think the eventual API should be for this functionality within xarray:

Top-level function

xhistogram's xarray API is essentially one histogram function, which accepts one or more xarray DataArrays. Therefore we it makes sense to add histogram or hist as a top-level function, similar to the existing cov, corr, dot or polyval.

New methods

We could also add a datarray.hist() method for 1-D histograms and possibly a dataset.hist(vars=['density', 'temperature']) for quickly making N-D histograms.

The existing plot.hist method (the tricky bit)

There is already a da.plot.hist() method, which is a paper-thin wrapper around matplotlib.pyplot.hist, which flattens the dataarray before plotting the result. It would be nice if this internally dispatched to the new da.hist() method before plotting the result, but pyplot.hist does both the bincounting and the plotting, so it might not be simple to do that.

This is also potentially related to @dcherian 's PR for facets and hue with hist, in that a totally consistent treatment would use the axis-aware histogram algorithm to calculate each separate facet histogram by looping over non-binned dimensions. Again the problem is that AFAIK matplotlib doesn't offer a quick way to plot a histogram without recomputing the bins and counts. Any suggestions here?

Adding an optional dim argument to da.plot.hist() doesn't really make sense unless we also add faceting, because otherwise the only shape of result that da.plot.hist() could actually plot is one where we have binned over the entire flattened array.

It would also be nice if da.hist().plot.hist() was identical to da.plot.hist() which requires the format of the output of da.hist() to be compatible with da.plot.hist().

(We shouldn't need any kind of new da.plot.hist2d() method because as the xhistogram docs show you can already make very nice 2D histogram plots with da.plot().)

Signature

xhistogram adds bin coordinates (the bin centers) to the output dataarray, named after the quantities that were binned.

Following what we currently have, a minimal top-level function signature looks like

```python def hist(*datarrays, bins=None, dim=None, weights=None, density=False): """ Histogram applied along specified dimensions.

If any of the supplied arguments are dask arrays it will use `dask.array.blockwise`
internally to parallelize over all chunks.

datarrays : xarray.DataArray objects
    Input data. The number of input arguments determines the dimensionality of
    the histogram. For example, two arguments prodoce a 2D histogram.
bins :  int or array_like or a list of ints or arrays, or list of DataArrays, optional
    If a list, there should be one entry for each item in ``args``.
    The bin specification:

      * If int, the number of bins for all arguments in ``args``.
      * If array_like, the bin edges for all arguments in ``args``.
      * If a list of ints, the number of bins  for every argument in ``args``.
      * If a list arrays, the bin edges for each argument in ``args``
        (required format for Dask inputs).
      * A combination [int, array] or [array, int], where int
        is the number of bins and array is the bin edges.
      * If a list of DataArrays, the bins for each argument in ``args``
        The DataArrays can be multidimensional, but must not have any
        dimensions shared with the `dim` argument.

    When bin edges are specified, all but the last (righthand-most) bin include
    the left edge and exclude the right edge. The last bin includes both edges.

    A ``TypeError`` will be raised if ``args`` contains dask arrays and
    ``bins`` are not specified explicitly as a list of arrays.
dim : tuple of strings, optional
    Dimensions over which which the histogram is computed. The default is to
    compute the histogram of the flattened array.
weights : array_like, optional
    An array of weights, of the same shape as `a`.  Each value in
    `a` only contributes its associated weight towards the bin count
    (instead of 1). If `density` is True, the weights are
    normalized, so that the integral of the density over the range
    remains 1. NaNs in the weights input will fill the entire bin with
    NaNs. If there are NaNs in the weights input call ``.fillna(0.)``
    before running ``histogram()``.
density : bool, optional
    If ``False``, the result will contain the number of samples in
    each bin. If ``True``, the result is the value of the
    probability *density* function at the bin, normalized such that
    the *integral* over the range is 1. Note that the sum of the
    histogram values will not be equal to 1 unless bins of unity
    width are chosen; it is not a probability *mass* function.
"""

```

Weights could also possibly be set via the .weighted() method that we already have for other operations.

Checklist

Desired features in order to fully deprecate xhistogram: - [ ] axis-aware (ability to loop over dimensions instead of binning over them) - [ ] optional dask parallelization across all dimensions - [ ] weights - [ ] ~~accept dask-aware bin arrays?~~ - [ ] accept multi-dimensional bins arguments? (see https://github.com/xgcm/xhistogram/issues/28) - any others?

cc @dougiesquire @gjoseph92

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  750985364
Powered by Datasette · Queries took 0.685ms · About: xarray-datasette