home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

19 rows where issue = 218459353 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 9

  • shoyer 4
  • lumbric 3
  • aquasync 2
  • dcherian 2
  • andrew-c-ross 2
  • matteodefelice 2
  • fmaussion 2
  • leifdenby 1
  • andersy005 1

author_association 3

  • MEMBER 9
  • CONTRIBUTOR 6
  • NONE 4

issue 1

  • bottleneck : Wrong mean for float32 array · 19 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1119787557 https://github.com/pydata/xarray/issues/1346#issuecomment-1119787557 https://api.github.com/repos/pydata/xarray/issues/1346 IC_kwDOAMm_X85Cvpol dcherian 2448579 2022-05-06T16:22:32Z 2022-05-06T16:22:32Z MEMBER

On second thought we should add this to a FAQ page.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
1119786892 https://github.com/pydata/xarray/issues/1346#issuecomment-1119786892 https://api.github.com/repos/pydata/xarray/issues/1346 IC_kwDOAMm_X85CvpeM dcherian 2448579 2022-05-06T16:21:42Z 2022-05-06T16:21:42Z MEMBER

Yes that sounds right. Thanks!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
1119770101 https://github.com/pydata/xarray/issues/1346#issuecomment-1119770101 https://api.github.com/repos/pydata/xarray/issues/1346 IC_kwDOAMm_X85CvlX1 andersy005 13301940 2022-05-06T16:01:44Z 2022-05-06T16:01:44Z MEMBER
  • https://github.com/pydata/xarray/pull/5560 introduced "use_bottleneck" option to disable/enable using bottleneck. can we close this issue or keep it open?
{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
464338041 https://github.com/pydata/xarray/issues/1346#issuecomment-464338041 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDQ2NDMzODA0MQ== lumbric 691772 2019-02-16T11:20:20Z 2019-02-16T11:20:20Z CONTRIBUTOR

Oh yes, of course! I've underestimated the low precision of float32 values above 2**24. Thanks for the hint.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
463324373 https://github.com/pydata/xarray/issues/1346#issuecomment-463324373 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDQ2MzMyNDM3Mw== lumbric 691772 2019-02-13T19:02:52Z 2019-02-16T10:53:51Z CONTRIBUTOR

I think (!) xarray is not effected any longer, but pandas is. Bisecting the GIT history leads to commit 0b9ab2d1, which means that xarray >= v0.10.9 should not be affected. Uninstalling bottleneck is also a valid workaround.

<s>Bottleneck's documentation explicitly mentions that no error is raised in case of an overflow. But it seams to be very evil behavior, so it might be worth reporting upstream.</s> What do you think? (I think kwgoodman/bottleneck#164 is something different, isn't it?) Edit: this is not an overflow. It's a numerical error by not applying pairwise summation.

A couple of minimal examples:

```python

import numpy as np import pandas as pd import xarray as xr import bottleneck as bn bn.nanmean(np.ones(225, dtype=np.float32))
0.5 pd.Series(np.ones(2
25, dtype=np.float32)).mean()
0.5 xr.DataArray(np.ones(2**25, dtype=np.float32)).mean() # not affected for this version <xarray.DataArray ()> array(1., dtype=float32) ```

Done with the following versions: bash $ pip3 freeze Bottleneck==1.2.1 numpy==1.16.1 pandas==0.24.1 xarray==0.11.3 ...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
464115604 https://github.com/pydata/xarray/issues/1346#issuecomment-464115604 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDQ2NDExNTYwNA== shoyer 1217238 2019-02-15T16:39:08Z 2019-02-15T16:39:08Z MEMBER

The difference is that Bottleneck does the sum in the naive way, whereas NumPy uses the more numerically stable pairwise summation.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
464016154 https://github.com/pydata/xarray/issues/1346#issuecomment-464016154 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDQ2NDAxNjE1NA== lumbric 691772 2019-02-15T11:41:36Z 2019-02-15T11:41:36Z CONTRIBUTOR

Oh hm, I think I didn't really understand what happens in bottleneck.nanmean(). I understand that integers can overflow and that float32 have varying absolute precision. The max float32 3.4E+38 is not hit here. So how can the mean of a list of ones be 0.5?

Isn't this what bottleneck is doing? Summing up a bunch of float32 values and then dividing by the length?

```

d = np.ones(2**25, dtype=np.float32) d.sum()/np.float32(len(d)) 1.0 ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
464002579 https://github.com/pydata/xarray/issues/1346#issuecomment-464002579 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDQ2NDAwMjU3OQ== aquasync 5469 2019-02-15T11:06:06Z 2019-02-15T11:06:06Z NONE

Ah ok, I suppose bottleneck is indeed now avoided for float32 xarray. Yeah that issue is for a different function, but the source of the problem and proposed solution in the thread is the same - use higher precision intermediates for float32 (double arithmetic); a small speed vs accuracy/precision trade off.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
458427512 https://github.com/pydata/xarray/issues/1346#issuecomment-458427512 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDQ1ODQyNzUxMg== aquasync 5469 2019-01-29T06:52:01Z 2019-01-29T06:52:01Z NONE

Is it worth changing bottleneck to use double for single precision reductions? AFAICT this is a matter of changing npy_DTYPE0 to double in the float{64,32} versions of functions in reduce_template.c.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
456173428 https://github.com/pydata/xarray/issues/1346#issuecomment-456173428 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDQ1NjE3MzQyOA== shoyer 1217238 2019-01-21T19:09:43Z 2019-01-21T19:09:43Z MEMBER

Would it be worth adding a warning (until the right solution is found) if someone is doing .mean() on a DataArray which is float32?

I would rather pick option (1) above, that is, "Stop using bottleneck on float32 arrays"

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
456149964 https://github.com/pydata/xarray/issues/1346#issuecomment-456149964 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDQ1NjE0OTk2NA== leifdenby 2405019 2019-01-21T17:33:31Z 2019-01-21T17:33:31Z CONTRIBUTOR

Sorry to unearth this issue again, but I just got bitten by this quite badly. I'm looking at absolute temperature perturbations and bottleneck's implementation together with my data being loaded as float32 (correctly, as it's stored like that) causes an error on the size of the perturbations I'm looking for.

Example:

``` In [1]: import numpy as np ...: import bottleneck

In [2]: a = 300np.ones((800*2,), dtype=np.float32)

In [3]: np.mean(a) Out[3]: 300.0

In [4]: bottleneck.nanmean(a) Out[4]: 302.6018981933594 ```

Would it be worth adding a warning (until the right solution is found) if someone is doing .mean() on a DataArray which is float32?

Based a little experimentation (https://gist.github.com/leifdenby/8e874d3440a1ac96f96465a418f158ab) bottleneck's mean function builds up significant errors even with moderately sized arrays if they are float32, so I'm going to stop using .mean() as-is from now and always pass in dtype=np.float64.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
290851733 https://github.com/pydata/xarray/issues/1346#issuecomment-290851733 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDI5MDg1MTczMw== shoyer 1217238 2017-03-31T22:55:18Z 2017-03-31T22:55:18Z MEMBER

@matteodefelice you didn't decide on float32, but your data is stored that way. It's really hard to make choices about numerical precision for computations automatically: if we converted automatically to float64, somebody else would be complaining about unexpected memory usage :).

Looking at our options, we could:

  1. Stop using bottleneck on float32 arrays, or provide a flag or option to disable using bottleneck. This is not ideal, because bottleneck is much faster.
  2. Automatically convert float32 arrays to float64 before doing aggregations. This is not ideal, because it could significant increase memory requirements.
  3. Add a dtype option for aggregations (like NumPy) and consider defaulting to dype=np.float64 when doing aggregations on float32 arrays. I would generally be happy with this, but bottleneck currently doesn't provide the option currently.
  4. Write a higher precision algorithm for bottleneck's mean.
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
290822179 https://github.com/pydata/xarray/issues/1346#issuecomment-290822179 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDI5MDgyMjE3OQ== matteodefelice 6360066 2017-03-31T20:31:56Z 2017-03-31T20:31:56Z NONE

Thanks all guys for the replies. @Aegaeon I get the same your results with bottleneck... @shoyer The point is that I haven't decided the use of float32 and — yes — using .astype(np.float64) solves the issue...the point is that is not an expected behaviour, with such standard dataset I would not expect any problem related to numerical precision...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
290760342 https://github.com/pydata/xarray/issues/1346#issuecomment-290760342 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDI5MDc2MDM0Mg== shoyer 1217238 2017-03-31T16:24:04Z 2017-03-31T16:24:04Z MEMBER

Yes, this is probably related to the fact that .mean() in xarray uses bottleneck if available, and bottleneck has a slightly different mean implementation, quite possibly with a less numerically stable algorithm.

The fact that the dtype is float32 is a sign that this is probably a numerical precision issue. Try casting with .astype(np.float64) and see if the problem goes away.

If you really cared about performance using float32, the other thing to do to improve conditioning is to subtract and add a number close to the mean, e.g., (ds.var167 - 270).mean() + 270.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
290755867 https://github.com/pydata/xarray/issues/1346#issuecomment-290755867 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDI5MDc1NTg2Nw== andrew-c-ross 5852283 2017-03-31T16:07:56Z 2017-03-31T16:07:56Z CONTRIBUTOR

I think this might be a problem with bottleneck? My interpretation of _create_nan_agg_method in xarray/core/ops.py is that it may use bottleneck to get the mean unless you pass skipna=False or specify multiple axes. And,

```python In [2]: import bottleneck In [3]: bottleneck.version Out[3]: '1.2.0'

In [6]: bottleneck.nanmean(ds.var167.data) Out[6]: 261.6441345214844 ```

Forgive me if I'm wrong, I'm still a bit new.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
290754443 https://github.com/pydata/xarray/issues/1346#issuecomment-290754443 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDI5MDc1NDQ0Mw== fmaussion 10050469 2017-03-31T16:02:53Z 2017-03-31T16:02:53Z MEMBER

Does it make a difference if you load the data first? (ds.var167.load().mean()) Or use python 3?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
290747253 https://github.com/pydata/xarray/issues/1346#issuecomment-290747253 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDI5MDc0NzI1Mw== andrew-c-ross 5852283 2017-03-31T15:38:12Z 2017-03-31T15:53:07Z CONTRIBUTOR

Also on macOS, and I can reproduce.

Using python 2.7.11, xarray 0.9.1, dask 0.14.1 installed through Anaconda. I get the same results with xarray 0.9.1-38-gc0178b7 from GitHub.

```python In [3]: ds = xarray.open_dataset('ERAIN-t2m-1983-2012.seasmean.nc')

In [4]: ds.var167.mean() Out[4]: <xarray.DataArray 'var167' ()> array(261.6441345214844, dtype=float32) ```

Curiously, I get the right results with skipna=False...

python In [10]: ds.var167.mean(skipna=False) Out[10]: <xarray.DataArray 'var167' ()> array(278.6246643066406, dtype=float32)

... or by specifying coordinates to average over:

python In [5]: ds.var167.mean(('time', 'lat', 'lon')) Out[5]: <xarray.DataArray 'var167' ()> array(278.6246643066406, dtype=float32)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
290692479 https://github.com/pydata/xarray/issues/1346#issuecomment-290692479 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDI5MDY5MjQ3OQ== matteodefelice 6360066 2017-03-31T11:53:12Z 2017-03-31T11:53:12Z NONE

Ok, I am on MacOS: - Python 2.7.13 from Macports - Dask 0.14.1 from Macports - xarray from GitHub

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
290691941 https://github.com/pydata/xarray/issues/1346#issuecomment-290691941 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDI5MDY5MTk0MQ== fmaussion 10050469 2017-03-31T11:50:05Z 2017-03-31T11:50:05Z MEMBER

I can't reproduce this:

```python In [6]: ds = xr.open_dataset('./Downloads/ERAIN-t2m-1983-2012.seasmean.nc')

In [7]: ds.var167.mean() Out[7]: <xarray.DataArray 'var167' ()> array(278.6246643066406, dtype=float32)

In [8]: ds.var167.data.mean() Out[8]: 278.62466 ```

which version of xarray, dask, python are you using?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1750.609ms · About: xarray-datasette