home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

12 rows where author_association = "CONTRIBUTOR" and issue = 343659822 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • mankoff 5
  • dopplershift 4
  • Thomas-Z 3

issue 1

  • float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray · 12 ✖

author_association 1

  • CONTRIBUTOR · 12 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1201464999 https://github.com/pydata/xarray/issues/2304#issuecomment-1201464999 https://api.github.com/repos/pydata/xarray/issues/2304 IC_kwDOAMm_X85HnOan mankoff 145117 2022-08-01T16:56:01Z 2022-08-01T16:56:01Z CONTRIBUTOR

Packing Qs

  • If "the variable containing the packed data must be of type byte, short or int", how do we choose what size int?
  • What to do if scale_factor and add_offset are not float or double? What if they are different types?
    • I assume issue a warning and continue?

Unpacking Qs

  • Should the unpacked data just be np.find_common_type([data, add_offset, scale_factor], []), or should we then bump the type up by 1 level (float16->32, 32->64, 64->128, etc.) to cover overflow?
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
1201461626 https://github.com/pydata/xarray/issues/2304#issuecomment-1201461626 https://api.github.com/repos/pydata/xarray/issues/2304 IC_kwDOAMm_X85HnNl6 mankoff 145117 2022-08-01T16:52:47Z 2022-08-01T16:52:47Z CONTRIBUTOR
  • From: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html

This standard is more restrictive than the NUG with respect to the use of the scale<sub>factor</sub> and add<sub>offset</sub> attributes; ambiguities and precision problems related to data type conversions are resolved by these restrictions.

If the scale<sub>factor</sub> and add<sub>offset</sub> attributes are of the same data type as the associated variable, the unpacked data is assumed to be of the same data type as the packed data.

  • What if the result of the operation leads to overflow?

However, if the scale<sub>factor</sub> and add<sub>offset</sub> attributes are of a different data type from the variable (containing the packed data) then the unpacked data should match the type of these attributes, which must both be of type float or both be of type double.

  • What if they are not of the same type?

    • Presumably, use the largest of the three types.
  • Again, this may lead to loss of precision. what if packed data is type int64 and scale<sub>factor</sub> is type float16. Seems like the result should be float64, not float16.

An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int.

  • What to do if packed data is type float or double?

It is not advised to unpack an int into a float as there is a potential precision loss.

I think this means double is advised? If so, this should be stated. Should be rephrased to advise what to do (if there is one or only a few choices) rather than what not to do, or at least include that if not replacing current wording.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
1200627783 https://github.com/pydata/xarray/issues/2304#issuecomment-1200627783 https://api.github.com/repos/pydata/xarray/issues/2304 IC_kwDOAMm_X85HkCBH mankoff 145117 2022-08-01T02:49:28Z 2022-08-01T05:55:15Z CONTRIBUTOR

Current algorithm

python def _choose_float_dtype(dtype, has_offset): """Return a float dtype that can losslessly represent `dtype` values.""" # Keep float32 as-is. Upcast half-precision to single-precision, # because float16 is "intended for storage but not computation" if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating): return np.float32 # float32 can exactly represent all integers up to 24 bits if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer): # A scale factor is entirely safe (vanishing into the mantissa), # but a large integer offset could lead to loss of precision. # Sensitivity analysis can be tricky, so we just use a float64 # if there's any offset at all - better unoptimised than wrong! if not has_offset: return np.float32 # For all other types and circumstances, we just use float64. # (safe because eg. complex numbers are not supported in NetCDF) return np.float64

Due to calling bug, has_offset is always None, so this can be simplified to:

python def _choose_float_dtype(dtype) if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating): return np.float32 if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer): return np.float32 return np.float64

Here I call the function twice, once with has_offset False, then True.

```python import numpy as np

def _choose_float_dtype(dtype, has_offset): if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating): return np.float32 if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer): if not has_offset: return np.float32 return np.float64

generic types

for dtype in [np.byte, np.ubyte, np.short, np.ushort, np.intc, np.uintc, np.int_, np.uint, np.longlong, np.ulonglong, np.half, np.float16, np.single, np.double, np.longdouble, np.csingle, np.cdouble, np.clongdouble, np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64, np.float16, np.float32, np.float64]: print("|", dtype, "|", _choose_float_dtype(np.dtype(dtype), False), "|", _choose_float_dtype(np.dtype(dtype), True), "|") ```

| Input | Output as called | Output as written | |-----------------------------|---------------------------|--------------------------| | <class 'numpy.int8'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.uint8'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.int16'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.uint16'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.int32'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.uint32'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.int64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.uint64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.longlong'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.ulonglong'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.float16'> | <class 'numpy.float32'> | <class 'numpy.float32'> | | <class 'numpy.float16'> | <class 'numpy.float32'> | <class 'numpy.float32'> | | <class 'numpy.float32'> | <class 'numpy.float32'> | <class 'numpy.float32'> | | <class 'numpy.float64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.float128'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.complex64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.complex128'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.complex256'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.int8'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.int16'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.int32'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.int64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.uint8'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.uint16'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.uint32'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.uint64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.float16'> | <class 'numpy.float32'> | <class 'numpy.float32'> | | <class 'numpy.float32'> | <class 'numpy.float32'> | <class 'numpy.float32'> | | <class 'numpy.float64'> | <class 'numpy.float64'> | <class 'numpy.float64'> |

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
1200266255 https://github.com/pydata/xarray/issues/2304#issuecomment-1200266255 https://api.github.com/repos/pydata/xarray/issues/2304 IC_kwDOAMm_X85HipwP mankoff 145117 2022-07-30T17:58:51Z 2022-07-30T17:58:51Z CONTRIBUTOR

This issue, based on its title and initial post, is fixed by PR #6851. The code to select dtype was already correct, but the outer function that called it had a bug in the call.

Per the CF spec,

the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int. It is not advised to unpack an int into a float as there is a potential precision loss.

I find this is ambiguous. is float above referring to float16 or float32? Is double referring to float64? If so, then they do recommend float64, as requested by the OP, because the test data is short and the scale_factor is float64 (a.k.a double?)

The broader discussion here is about CF compliance. I find the spec ambiguous and xarray non-compliant. So many tests rely on the existing behavior, that I am unsure how best to proceed to improve compliance. I worry it may be a major refactor, and possibly break things relying on the existing behavior. I'd like to discuss architecture. Should this be in a new issue, if this closes with PR #6851? Should there be a new keyword for cf_strict or something?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
1188529343 https://github.com/pydata/xarray/issues/2304#issuecomment-1188529343 https://api.github.com/repos/pydata/xarray/issues/2304 IC_kwDOAMm_X85G14S_ mankoff 145117 2022-07-19T02:35:30Z 2022-07-19T03:20:51Z CONTRIBUTOR

I've run into this issue too, and the xarray decision to use float32 is causing problems. I recognize this is a generic floating-point representation issue, but it could be avoided with float64.

The data value is 1395. The scale is 0.0001.

python val = int(1395) scale = 0.0001 print(val*scale) # 0.1395 print( val * np.array(scale).astype(float) ) # 0.1395 print( val * np.array(scale).astype(np.float16) ) # 0.1395213... print( val * np.array(scale).astype(np.float32) ) # 0.13949999... print( val * np.array(scale).astype(np.float64) ) # 0.1395

Because we are using *1E3 * round(), the difference between 0.1395 and 0.1394999 (or 139.5 and 139.49) ends up being quite large in the downstream product.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
410792506 https://github.com/pydata/xarray/issues/2304#issuecomment-410792506 https://api.github.com/repos/pydata/xarray/issues/2304 MDEyOklzc3VlQ29tbWVudDQxMDc5MjUwNg== Thomas-Z 1492047 2018-08-06T17:47:23Z 2019-01-09T15:18:36Z CONTRIBUTOR

To explain the full context and why it became some kind of a problem to us :

We're experimenting with the parquet format (via pyarrow) and we first did something like : netcdf file -> netcdf4 -> pandas -> pyarrow -> pandas (when read later on).

We're now looking at xarray and the huge ease of access it offers to netcdf like data and we tried something similar : netcdf file -> xarray -> pandas -> pyarrow -> pandas (when read later on).

Our problem appears when we're reading and comparing the data stored with these 2 approches. The difference between the 2 was - sometimes - larger than what expected/acceptable (10e-6 for float32 if I'm not mistaken). We're not constraining any type and letting the system and modules decide how to encode what and in the end we have significantly different values.

There might be something wrong in our process but it originate here with this float32/float64 choice so we thought it might be a problem.

Thanks for taking the time to look into this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
411385081 https://github.com/pydata/xarray/issues/2304#issuecomment-411385081 https://api.github.com/repos/pydata/xarray/issues/2304 MDEyOklzc3VlQ29tbWVudDQxMTM4NTA4MQ== Thomas-Z 1492047 2018-08-08T12:18:02Z 2018-08-22T07:14:58Z CONTRIBUTOR

So, a more complete example showing this problem. NetCDF file used in the example : test.nc.zip

````python from netCDF4 import Dataset import xarray as xr import numpy as np import pandas as pd

d = Dataset("test.nc") v = d.variables['var']

print(v)

<class 'netCDF4._netCDF4.Variable'>

int16 var(idx)

_FillValue: 32767

scale_factor: 0.01

unlimited dimensions:

current shape = (2,)

filling on

df_nc = pd.DataFrame(data={'var': v[:]})

print(df_nc)

var

0 21.94

1 27.04

ds = xr.open_dataset("test.nc") df_xr = ds['var'].to_dataframe()

Comparing both dataframes with float32 precision (1e-6)

mask = np.isclose(df_nc['var'], df_xr['var'], rtol=0, atol=1e-6)

print(mask)

[False True]

print(df_xr)

var

idx

0 21.939999

1 27.039999

Changing the type and rounding the xarray dataframe

df_xr2 = df_xr.astype(np.float64).round(int(np.ceil(-np.log10(ds['var'].encoding['scale_factor'])))) mask = np.isclose(df_nc['var'], df_xr2['var'], rtol=0, atol=1e-6)

print(mask)

[ True True]

print(df_xr2)

var

idx

0 21.94

1 27.04

````

As you can see, the problem appears early in the process (not related to the way data are stored in parquet later on) and yes, rounding values does solve it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
410782982 https://github.com/pydata/xarray/issues/2304#issuecomment-410782982 https://api.github.com/repos/pydata/xarray/issues/2304 MDEyOklzc3VlQ29tbWVudDQxMDc4Mjk4Mg== dopplershift 221526 2018-08-06T17:17:38Z 2018-08-06T17:17:38Z CONTRIBUTOR

Ah, ok, not scaling per-se (i.e. * 0.01), but a second round of value conversion.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
410779271 https://github.com/pydata/xarray/issues/2304#issuecomment-410779271 https://api.github.com/repos/pydata/xarray/issues/2304 MDEyOklzc3VlQ29tbWVudDQxMDc3OTI3MQ== dopplershift 221526 2018-08-06T17:06:22Z 2018-08-06T17:06:22Z CONTRIBUTOR

I'm not following why the data are scaled twice.

Your point about the rounding being different is well-taken, though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
410774955 https://github.com/pydata/xarray/issues/2304#issuecomment-410774955 https://api.github.com/repos/pydata/xarray/issues/2304 MDEyOklzc3VlQ29tbWVudDQxMDc3NDk1NQ== dopplershift 221526 2018-08-06T16:52:42Z 2018-08-06T16:52:53Z CONTRIBUTOR

@shoyer But since it's a downstream calculation issue, and does not impact the actual precision of what's being read from the file, what's wrong with saying "Use data.astype(np.float64)". It's completely identical to doing it internally to xarray.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
410769706 https://github.com/pydata/xarray/issues/2304#issuecomment-410769706 https://api.github.com/repos/pydata/xarray/issues/2304 MDEyOklzc3VlQ29tbWVudDQxMDc2OTcwNg== dopplershift 221526 2018-08-06T16:34:44Z 2018-08-06T16:36:16Z CONTRIBUTOR

A float32 values has 24 bits of precision in the significand, which is more than enough to store the 16-bits in in the original data; the exponent (8 bits) will more or less take care of the * 0.01:

```python

import numpy as np np.float32(2194 * 0.01) 21.94 ```

What you're seeing is an artifact of printing out the values. I have no idea why something is printing out a float (only 7 decimal digits) out to 17 digits. Even float64 only has 16 digits (which is overkill for this application).

The difference in subtracting the 32- and 64-bit values above are in the 8th decimal place, which is beyond the actual precision of the data; what you've just demonstrated is the difference in precision between 32-bit and 64-bit values, but it had nothing to do whatsoever with the data.

If you're really worried about precision round-off for things like std. dev, you should probably calculate it using the raw integer values and scale afterwards. (I don't actually think this is necessary, though.)

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
410675562 https://github.com/pydata/xarray/issues/2304#issuecomment-410675562 https://api.github.com/repos/pydata/xarray/issues/2304 MDEyOklzc3VlQ29tbWVudDQxMDY3NTU2Mg== Thomas-Z 1492047 2018-08-06T11:19:30Z 2018-08-06T11:19:30Z CONTRIBUTOR

You're right when you say

Note that it's very easy to later convert from float32 to float64, e.g., by writing ds.astype(np.float64).

You'll have a float64 in the end but you won't get your precision back and it might be a problem in some case.

I understand the benefits of using float32 on the memory side but it is kind of a problem for us each time we have variables using scale factors.

I'm surprised this issue (if considered as one) does not bother more people.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.711ms · About: xarray-datasette