home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

21 rows where issue = 750985364 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 5

  • TomNicholas 10
  • aaronspring 6
  • Illviljan 3
  • dcherian 1
  • dougiesquire 1

author_association 3

  • MEMBER 14
  • CONTRIBUTOR 6
  • NONE 1

issue 1

  • Add histogram method · 21 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1433726618 https://github.com/pydata/xarray/issues/4610#issuecomment-1433726618 https://api.github.com/repos/pydata/xarray/issues/4610 IC_kwDOAMm_X85VdO6a TomNicholas 35968931 2023-02-16T21:17:56Z 2023-02-16T21:17:56Z MEMBER

it was triggering a load

Can we not just test the in-memory performance by .load()-ing first? Then worry about dask performance? That's what I was vaguely getting at in my comment, trying the in-memory performance but also plotting the dask graph.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
1433686861 https://github.com/pydata/xarray/issues/4610#issuecomment-1433686861 https://api.github.com/repos/pydata/xarray/issues/4610 IC_kwDOAMm_X85VdFNN Illviljan 14371165 2023-02-16T20:39:54Z 2023-02-16T20:39:54Z MEMBER

Nice, I was looking at the real example too, Temp_url = 'http://apdrc.soest.hawaii.edu:80/dods/public_data/WOA/WOA13/5_deg/annual/temp' etc.., and it was triggering a load in set_dims:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
1433675446 https://github.com/pydata/xarray/issues/4610#issuecomment-1433675446 https://api.github.com/repos/pydata/xarray/issues/4610 IC_kwDOAMm_X85VdCa2 TomNicholas 35968931 2023-02-16T20:29:25Z 2023-02-16T20:29:25Z MEMBER

Could you show the example that's this slow, @TomNicholas ? So I can play around with it too.

I think I just timed the difference in the (unweighted) "real" example I gave in the notebook. (Not the weighted one because that didn't give the right answer with flox for some reason.)

One thing I noticed in your notebook is that you haven't used chunks={} on the open_dataset. Which seems to trigger data loading on strange places in xarray (places that calls self.data), but I'm not sure this is your actual problem.

Fair point, worth trying.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
1433670641 https://github.com/pydata/xarray/issues/4610#issuecomment-1433670641 https://api.github.com/repos/pydata/xarray/issues/4610 IC_kwDOAMm_X85VdBPx Illviljan 14371165 2023-02-16T20:24:51Z 2023-02-16T20:25:36Z MEMBER
  • Absolute speed of xhistogram appears to be 3-4x higher, and that's using numpy_groupies in flox. Possibly flox could be faster if using numba but not sure yet.

Could you show the example that's this slow, @TomNicholas ? So I can play around with it too.

One thing I noticed in your notebook is that you haven't used chunks={} on the open_dataset. Which seems to trigger data loading on strange places in xarray (places that calls self.data), but I'm not sure this is your actual problem.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
1425198851 https://github.com/pydata/xarray/issues/4610#issuecomment-1425198851 https://api.github.com/repos/pydata/xarray/issues/4610 IC_kwDOAMm_X85U8s8D dcherian 2448579 2023-02-10T05:38:13Z 2023-02-10T05:38:31Z MEMBER

Absolute speed of xhistogram appears to be 3-4x higher, and that's using numpy_groupies in flox. Possibly flox could be faster if using numba but not sure yet.

Nah, in my experience, the overhead is "factorizing" (pd.cut/np.digitize) or converting to integer bins, and then converting the nD problem to a 1D problem for bincount. numba doesn't really help.


3-4x is a lot bigger than I expected. I was hoping for under 2x because flox is more general.

I think the problem is pandas.cut is a lot slower than np.digitize

We could swap that out easily here: https://github.com/xarray-contrib/flox/blob/daebc868c13dad74a55d74f3e5d24e0f6bbbc118/flox/core.py#L473

I think the one special case to consider is binning datetimes, and that digitize and pd.cut have different defaults for side or closed.


Dask graphs simplicity. Xhistogram literally uses blockwise, whereas the flox graphs IIUC are blockwise-like but actually a specially-constructed HLG right now. (

blockwise and sum.

Ideallyflox would use a reduction that takes 2 array arguments (array to reduce, array to group by). Currently both cubed and dask onlt accept one argument.

As a workaround, we could replace dask.array._tree_reduce with dask.array.reduction(chunk=lambda x: x, ...) and then it would more or less all be public API that is common to dask and cubed.

Flox has various clever schemes for making general chunked groupby operations run more efficiently, but I don't think histogramming would really benefit from those unless there is a strong pattern to which values likely fall in which bins, that is known a priori.

Yup. unlikely to help here.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
1423144049 https://github.com/pydata/xarray/issues/4610#issuecomment-1423144049 https://api.github.com/repos/pydata/xarray/issues/4610 IC_kwDOAMm_X85U03Rx TomNicholas 35968931 2023-02-08T19:40:58Z 2023-02-08T20:25:04Z MEMBER

Q: Use xhistogram approach or flox-powered approach?

@dcherian recently showed how his flox package can perform histograms as groupby-like reductions. This begs the question of which approach would be better to use in a histogram function in xarray.

(This is related to but better than what we had tried previously with xarray groupby and numpy_groupies.)

Here's a WIP notebook comparing the two approaches.

Both approaches can feasibly do: - Histograms which leave some dimensions excluded (broadcast over), - Multi-dimensional histograms (e.g. binning two different variables into one 2D bin), - Normalized histograms (return PDFs instead of counts), - Weighted histograms, - Multi-dimensional bins (as @aaronspring asks for above - but it requires work - see how to do it flox, and my stalled PR to xhistogram).

Pros of using flox-powered reductions:

  • Much less code - the flox approach is basically one call to flox.
  • Fewer codepaths, with groupby logic and all histogram functionality flowing through the flox.xarray_reduce codepath.
  • Likely clearer code than the kinda impenetrable reshaped bincount logic lurking in the depths of xhistogram.
  • Supporting new features (e.g. multidimensional bins) should be simpler in flox because the options don't have to be propagated all the way down to the level of the np.bincount caller.

Pros of using xhistogram's blockwise bincount approach:

  • Absolute speed of xhistogram appears to be 3-4x higher, and that's using numpy_groupies in flox. Possibly flox could be faster if using numba but not sure yet.
  • Dask graphs simplicity. Xhistogram literally uses blockwise, whereas the flox graphs IIUC are blockwise-like but actually a specially-constructed HLG right now. (Also important for supporting other parallel backends.) I suspect that in practice both perform similarly well after graph optimization but I have not tested this at scale, and flox's graph might be more sensitive to extra steps in the calculation like adding weights or normalising the result.

Other thoughts:

  • Flox has various clever schemes for making general chunked groupby operations run more efficiently, but I don't think histogramming would really benefit from those unless there is a strong pattern to which values likely fall in which bins, that is known a priori.
  • Deepak's example using flox uses pandas.IntervalIndex to represent the bins on the result object, whereas xhistogram just returns the mid-points of the bins, throwing that info away. This seems like a cool idea on it's own, but probably requires some extra work to make sure it's handled by the indexes refactor and the plotting code.
  • In my comparison notebook here's something I'm missing that's causing my "real example" (from xhistogram docs) to not actually use the provided weights. I suspect its something simple, any idea @dcherian?

xref https://github.com/xgcm/xhistogram/issues/60, https://github.com/xgcm/xhistogram/issues/28

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
858693187 https://github.com/pydata/xarray/issues/4610#issuecomment-858693187 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1ODY5MzE4Nw== TomNicholas 35968931 2021-06-10T14:54:31Z 2021-06-10T14:54:31Z MEMBER

We may want to also reimplement it using numpy_groupies, which I think is smarter than our implementation in xhistogram.

Given the performance I found in https://github.com/xgcm/xhistogram/issues/60, I think we probably want to use the dask.blockwise approach instead of the numpy_groupies one.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
853143060 https://github.com/pydata/xarray/issues/4610#issuecomment-853143060 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1MzE0MzA2MA== TomNicholas 35968931 2021-06-02T15:51:28Z 2021-06-02T15:51:36Z MEMBER

Okay great, thanks for the patient explanation @aaronspring ! Will tag you when this has progressed to the point that you can try it out.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
853138609 https://github.com/pydata/xarray/issues/4610#issuecomment-853138609 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1MzEzODYwOQ== aaronspring 12237157 2021-06-02T15:45:45Z 2021-06-02T15:45:45Z CONTRIBUTOR

This makes sense, but it sounds like this suggestion (of accepting Datasets not just DataArrays) is mostly a convenience tool for applying histograms to particular variables across multiple datasets quickly. It's not fundamentally different to picking and choosing the variables you want from multiple datasets and feeding them in to histogram as dataarrays.

agree.

I think we should focus on including features that enable analyses that would otherwise be difficult or impossible, for example ND bins

looking forward to the PR

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
853123125 https://github.com/pydata/xarray/issues/4610#issuecomment-853123125 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1MzEyMzEyNQ== TomNicholas 35968931 2021-06-02T15:26:06Z 2021-06-02T15:26:06Z MEMBER

my point about the bins is that if the inputs are two xr.datasets, then also the bins should be two xr.datasets.

This makes sense, but it sounds like this suggestion (of accepting Datasets not just DataArrays) is mostly a convenience tool for applying histograms to particular variables across multiple datasets quickly. It's not fundamentally different to picking and choosing the variables you want from multiple datasets and feeding them in to histogram as dataarrays. That kind of selecting and organising pre-calculation is in my opinion problem-specific and something that's more appropriate in the library that calls xarray (i.e. xskillscore).

I think we should focus on including features that enable analyses that would otherwise be difficult or impossible, for example ND bins: without allowing bins to be >1D at a low level internally then it would be fairly difficult to replicate the same functionality just by wrapping histogram.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
852944667 https://github.com/pydata/xarray/issues/4610#issuecomment-852944667 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1Mjk0NDY2Nw== aaronspring 12237157 2021-06-02T11:22:07Z 2021-06-02T11:22:07Z CONTRIBUTOR

I like your explanation of the two different inputs @dougiesquire and for multi-dim datasets these must be xr.datasets. my point about the bins is that if the inputs are two xr.datasets, then also the bins should be two xr.datasets. If the bins were only two xr.Arrays, then how could these bins discriminate for the different variables in the inputs?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
852836671 https://github.com/pydata/xarray/issues/4610#issuecomment-852836671 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1MjgzNjY3MQ== dougiesquire 42455466 2021-06-02T08:12:06Z 2021-06-02T08:12:06Z NONE

We have a very thin wrapper of xhistogram in xskillscore for calculating histograms from Datasets. It simply calculates the histograms independently for all variables that exist in all Datasets. This makes sense in the context of calculating skill score where the first Dataset corresponds to observations and the second to forecasts, and we want to calculate the histograms between matched variables in each dataset. However, this might be quite a specific use case and is probably not what we'd want to do in the general case. I like @TomNicholas 's proposal for Dataset functionality.

Is this what you're getting at @aaronspring ? Or am I misunderstanding?

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
852428526 https://github.com/pydata/xarray/issues/4610#issuecomment-852428526 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1MjQyODUyNg== aaronspring 12237157 2021-06-01T20:36:25Z 2021-06-01T20:36:25Z CONTRIBUTOR

I am unsure about this and cannot manage to put my Südasien down precisely.

Calculating a contingency table for instance between two multivar inputs:

ˋˋˋ xhistogram(ds_observations_multivar, ds_forecast_multivar, bins=[ds_obs_multivar_edges, ds_forecast_multivar_edges ]) ˋˋˋ

as in https://github.com/xarray-contrib/xskillscore/blob/6f7be06098eefa1cdb90f7319f577c274621301c/xskillscore/core/contingency.py#L156

maybe @dougiesquire can phrase this more precisely

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
852369352 https://github.com/pydata/xarray/issues/4610#issuecomment-852369352 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1MjM2OTM1Mg== TomNicholas 35968931 2021-06-01T18:59:37Z 2021-06-01T18:59:37Z MEMBER

For each dataset in what? Do you mean for each input dataarray? I'm proposing an API in which you either pass multiple DataArrays as data (what xhistogram currently accepts), or you can call .histogram() as a method on a single dataset, equivalent to passing in all the data_vars of that one dataset. What would passing multiple datasets to histogram() allow you to do?

If bins is only a dataArray, I cannot have this. Can I?

If bins can be a list of multiple dataarrays then you can have this, right? i.e.

python histogram(da1, da2, bins=[bins_for_da1, bins_for_da2]) where bins_for_da1 is itself an ND xr.DataArray. With that you can have different bins for different data variables, as well as multidimensional bins that can vary over time/quantile etc.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
852364749 https://github.com/pydata/xarray/issues/4610#issuecomment-852364749 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1MjM2NDc0OQ== aaronspring 12237157 2021-06-01T18:51:47Z 2021-06-01T18:51:47Z CONTRIBUTOR

What I'm unclear about is what you want to achieve by inputting an xarray.Dataset that couldn't be done with inputs of ND xr.DataArrays as both data and bins?

with dataset bins I want to have different bin_edges for each dataset. If bins is only a dataArray, I cannot have this. Can I?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
852257864 https://github.com/pydata/xarray/issues/4610#issuecomment-852257864 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1MjI1Nzg2NA== TomNicholas 35968931 2021-06-01T16:21:39Z 2021-06-01T16:21:53Z MEMBER

I cannot find this in #5400. I should checkout and run the code locally.

5400 is right now just a skeleton, it won't compute anything other than a NotImplementedError.

defining bin edges based on quantiles of the climatology

One of the bullets above is for N-dimensional bins, passed as xr.DataArrays. If we allow multidimensional xr.DataArrays as bins, then you could pass bins which changed at each quantile in that way.

What I'm unclear about is what you want to achieve by inputting an xarray.Dataset that couldn't be done with inputs of ND xr.DataArrays as both data and bins?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
852231700 https://github.com/pydata/xarray/issues/4610#issuecomment-852231700 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1MjIzMTcwMA== aaronspring 12237157 2021-06-01T15:45:00Z 2021-06-01T15:45:00Z CONTRIBUTOR

I tried to show in https://gist.github.com/aaronspring/251553f132202cc91aadde03f2a452f9 how I would like to use xr.Datasets as bins, e.g. defining bin edges based on quantiles of the climatology, i.e. bin edges depend on lon, lat, maybe weekofyear and variable.

I tried show in the gist that I could be also nice to allow xr.Datasets as bins if the inputs are xr.Datasets.

You can already choose different bins to use for each input variable`

I cannot find this in #5400. I should checkout and run the code locally.

Yep, the example xskillscore code posted doesnt allow nd bins. forgot that. correct. in my head thinking about the future it does. https://github.com/xarray-contrib/xskillscore/blob/6f7be06098eefa1cdb90f7319f577c274621301c/xskillscore/core/probabilistic.py#L498 takes xr.Datasets as bins and in a previous version we used xhist but then changed to make this run on nd arrays and xr.Datasets.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
852174014 https://github.com/pydata/xarray/issues/4610#issuecomment-852174014 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1MjE3NDAxNA== TomNicholas 35968931 2021-06-01T14:31:01Z 2021-06-01T14:31:01Z MEMBER

@aaronspring I'm a bit confused by your comment.

The (proposed) API in #5400 does have a Dataset.hist() method, but it would just create an N-D histogram of the N variables in the dataset. The idea being that if I had only loaded the variables of interest I could immediately make the histogram of interest, e.g. python ds = open(file, vars=['temperature', 'salinity']) ds.hist() # creates 2D temperature-salinity histogram

That's not the same thing as using Datasets as bins though - but I'm not really sure I understand the use case for that or what that allows? You can already choose different bins to use for each input variable, are you saying it would be neater if you could assign bins to input variables via a dict-like dataset rather than the arguments being in the corresponding positions in a list?

The example you linked doesn't pass datasets as bins either, it just loops over multiple input datasets and assumes you want to calculate joint histograms between those datasets.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
852120100 https://github.com/pydata/xarray/issues/4610#issuecomment-852120100 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg1MjEyMDEwMA== aaronspring 12237157 2021-06-01T13:22:25Z 2021-06-01T13:24:19Z CONTRIBUTOR

what about a list of xarray.Datasets as bins? suppose you have an xr.Dataset you want to bin different variables for different bins (eg from xr.Dataset.quantile)

@dougiesquire implemented this in https://github.com/xarray-contrib/xskillscore/blob/2217b58c536ec1b3d2c42265ed6689a740c2b3bf/xskillscore/core/utils.py#L133

EDIT: seeing now that this issue and #5400 aims to implement xr.DataArray.hist only. xr.Dataset would be also nice :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
844322268 https://github.com/pydata/xarray/issues/4610#issuecomment-844322268 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg0NDMyMjI2OA== TomNicholas 35968931 2021-05-19T17:37:49Z 2021-05-28T14:07:25Z MEMBER

Update on this: in a PR to xhistogram we have a rough proof-of-principle for a dask-parallelized, axis-aware implementation of N-dimensional histogram calculations, suitable for eventually integrating into xarray.

We still need to complete the work over in xhistogram, but for now I want to suggest what I think the eventual API should be for this functionality within xarray:

Top-level function

xhistogram's xarray API is essentially one histogram function, which accepts one or more xarray DataArrays. Therefore we it makes sense to add histogram or hist as a top-level function, similar to the existing cov, corr, dot or polyval.

New methods

We could also add a datarray.hist() method for 1-D histograms and possibly a dataset.hist(vars=['density', 'temperature']) for quickly making N-D histograms.

The existing plot.hist method (the tricky bit)

There is already a da.plot.hist() method, which is a paper-thin wrapper around matplotlib.pyplot.hist, which flattens the dataarray before plotting the result. It would be nice if this internally dispatched to the new da.hist() method before plotting the result, but pyplot.hist does both the bincounting and the plotting, so it might not be simple to do that.

This is also potentially related to @dcherian 's PR for facets and hue with hist, in that a totally consistent treatment would use the axis-aware histogram algorithm to calculate each separate facet histogram by looping over non-binned dimensions. Again the problem is that AFAIK matplotlib doesn't offer a quick way to plot a histogram without recomputing the bins and counts. Any suggestions here?

Adding an optional dim argument to da.plot.hist() doesn't really make sense unless we also add faceting, because otherwise the only shape of result that da.plot.hist() could actually plot is one where we have binned over the entire flattened array.

It would also be nice if da.hist().plot.hist() was identical to da.plot.hist() which requires the format of the output of da.hist() to be compatible with da.plot.hist().

(We shouldn't need any kind of new da.plot.hist2d() method because as the xhistogram docs show you can already make very nice 2D histogram plots with da.plot().)

Signature

xhistogram adds bin coordinates (the bin centers) to the output dataarray, named after the quantities that were binned.

Following what we currently have, a minimal top-level function signature looks like

```python def hist(*datarrays, bins=None, dim=None, weights=None, density=False): """ Histogram applied along specified dimensions.

If any of the supplied arguments are dask arrays it will use `dask.array.blockwise`
internally to parallelize over all chunks.

datarrays : xarray.DataArray objects
    Input data. The number of input arguments determines the dimensionality of
    the histogram. For example, two arguments prodoce a 2D histogram.
bins :  int or array_like or a list of ints or arrays, or list of DataArrays, optional
    If a list, there should be one entry for each item in ``args``.
    The bin specification:

      * If int, the number of bins for all arguments in ``args``.
      * If array_like, the bin edges for all arguments in ``args``.
      * If a list of ints, the number of bins  for every argument in ``args``.
      * If a list arrays, the bin edges for each argument in ``args``
        (required format for Dask inputs).
      * A combination [int, array] or [array, int], where int
        is the number of bins and array is the bin edges.
      * If a list of DataArrays, the bins for each argument in ``args``
        The DataArrays can be multidimensional, but must not have any
        dimensions shared with the `dim` argument.

    When bin edges are specified, all but the last (righthand-most) bin include
    the left edge and exclude the right edge. The last bin includes both edges.

    A ``TypeError`` will be raised if ``args`` contains dask arrays and
    ``bins`` are not specified explicitly as a list of arrays.
dim : tuple of strings, optional
    Dimensions over which which the histogram is computed. The default is to
    compute the histogram of the flattened array.
weights : array_like, optional
    An array of weights, of the same shape as `a`.  Each value in
    `a` only contributes its associated weight towards the bin count
    (instead of 1). If `density` is True, the weights are
    normalized, so that the integral of the density over the range
    remains 1. NaNs in the weights input will fill the entire bin with
    NaNs. If there are NaNs in the weights input call ``.fillna(0.)``
    before running ``histogram()``.
density : bool, optional
    If ``False``, the result will contain the number of samples in
    each bin. If ``True``, the result is the value of the
    probability *density* function at the bin, normalized such that
    the *integral* over the range is 1. Note that the sum of the
    histogram values will not be equal to 1 unless bins of unity
    width are chosen; it is not a probability *mass* function.
"""

```

Weights could also possibly be set via the .weighted() method that we already have for other operations.

Checklist

Desired features in order to fully deprecate xhistogram: - [ ] axis-aware (ability to loop over dimensions instead of binning over them) - [ ] optional dask parallelization across all dimensions - [ ] weights - [ ] ~~accept dask-aware bin arrays?~~ - [ ] accept multi-dimensional bins arguments? (see https://github.com/xgcm/xhistogram/issues/28) - any others?

cc @dougiesquire @gjoseph92

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364
846418243 https://github.com/pydata/xarray/issues/4610#issuecomment-846418243 https://api.github.com/repos/pydata/xarray/issues/4610 MDEyOklzc3VlQ29tbWVudDg0NjQxODI0Mw== Illviljan 14371165 2021-05-22T14:46:13Z 2021-05-22T14:46:13Z MEMBER

but pyplot.hist does both the bincounting and the plotting, so it might not be simple to do that.

Should be fine I think. Matplolib explains how to use np.histogram-like results in the weights-parameter: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html python counts, bins = np.histogram(data) plt.hist(bins[:-1], bins, weights=counts)

Some reading if wanting to do the plot by hand: https://stackoverflow.com/questions/5328556/histogram-matplotlib https://stackoverflow.com/questions/33203645/how-to-plot-a-histogram-using-matplotlib-in-python-with-a-list-of-data

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  Add histogram method 750985364

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 16.566ms · About: xarray-datasette