home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 351000813

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
351000813 MDU6SXNzdWUzNTEwMDA4MTM= 2370 Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed 5179430 closed 0     6 2018-08-15T23:18:41Z 2020-08-17T00:07:12Z 2020-08-17T00:07:12Z CONTRIBUTOR      

Code Sample, a copy-pastable example if possible

Data file used is here: test.nc.zip Output from each statement is commented out. ```python import xarray as xr ds = xr.open_dataset('test.nc') ds.cold_rad_cnts.min()

13038.

ds.cold_rad_cnts.max()

13143.

ds.cold_rad_cnts.mean()

12640.583984

ds.cold_rad_cnts.std()

455.035156

ds.cold_rad_cnts.sum()

4.472997e+10

```

Problem description

As you can see above, the mean falls outside the range of the data, and the standard deviation is nearly two orders of magnitude higher than it should be. This is because a significant loss of precision is occurring when using bottleneck's nansum() on data with a float32 dtype. I demonstrated this effect here: https://github.com/kwgoodman/bottleneck/issues/193.

Naturally, this means that converting the data to float64 or any int dtype will give the correct result, as well as using numpy's built-in functions instead or uninstalling bottleneck. An example is shown below.

Expected Output

```python In [8]: import numpy as np

In [9]: np.nansum(ds.cold_rad_cnts) Out[9]: 46357123000.0

In [10]: np.nanmean(ds.cold_rad_cnts) Out[10]: 13100.413

In [11]: np.nanstd(ds.cold_rad_cnts) Out[11]: 8.158843

```

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.15.0 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.18.2 distributed: 1.22.1 matplotlib: None cartopy: None seaborn: None setuptools: 40.0.0 pip: 10.0.1 conda: None pytest: None IPython: 6.5.0 sphinx: None

Unfortunately this will probably not be fixed downstream anytime soon, so I think it would be nice if xarray provided some sort of automatic workaround for this rather than having to remember to manually convert my data if it's float32. I am thinking making float64 the default (as discussed in #2304 ) would be nice but perhaps it might also be good if there was at least a warning whenever bottleneck's nansum() is used on float32 arrays.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2370/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 6 rows from issue in issue_comments
Powered by Datasette · Queries took 0.633ms · About: xarray-datasette