github: issue_comments: 30 rows where issue = 343659822 sorted by updated

30 rows where issue = 343659822 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
1201464999	https://github.com/pydata/xarray/issues/2304#issuecomment-1201464999	https://api.github.com/repos/pydata/xarray/issues/2304	IC_kwDOAMm_X85HnOan	mankoff 145117	2022-08-01T16:56:01Z	2022-08-01T16:56:01Z	CONTRIBUTOR	Packing Qs If "the variable containing the packed data must be of type byte, short or int", how do we choose what size int? What to do if `scale_factor` and `add_offset` are not float or double? What if they are different types? I assume issue a warning and continue? Unpacking Qs Should the unpacked data just be `np.find_common_type([data, add_offset, scale_factor], [])`, or should we then bump the type up by 1 level (float16->32, 32->64, 64->128, etc.) to cover overflow?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
1201461626	https://github.com/pydata/xarray/issues/2304#issuecomment-1201461626	https://api.github.com/repos/pydata/xarray/issues/2304	IC_kwDOAMm_X85HnNl6	mankoff 145117	2022-08-01T16:52:47Z	2022-08-01T16:52:47Z	CONTRIBUTOR	From: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html This standard is more restrictive than the NUG with respect to the use of the scale<sub>factor</sub> and add<sub>offset</sub> attributes; ambiguities and precision problems related to data type conversions are resolved by these restrictions. If the scale<sub>factor</sub> and add<sub>offset</sub> attributes are of the same data type as the associated variable, the unpacked data is assumed to be of the same data type as the packed data. What if the result of the operation leads to overflow? However, if the scale<sub>factor</sub> and add<sub>offset</sub> attributes are of a different data type from the variable (containing the packed data) then the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. What if they are not of the same type? Presumably, use the largest of the three types. Again, this may lead to loss of precision. what if packed data is type int64 and scale<sub>factor</sub> is type float16. Seems like the result should be float64, not float16. An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int. What to do if packed data is type float or double? It is not advised to unpack an int into a float as there is a potential precision loss. I think this means double is advised? If so, this should be stated. Should be rephrased to advise what to do (if there is one or only a few choices) rather than what not to do, or at least include that if not replacing current wording.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
1200627783	https://github.com/pydata/xarray/issues/2304#issuecomment-1200627783	https://api.github.com/repos/pydata/xarray/issues/2304	IC_kwDOAMm_X85HkCBH	mankoff 145117	2022-08-01T02:49:28Z	2022-08-01T05:55:15Z	CONTRIBUTOR	Current algorithm python def _choose_float_dtype(dtype, has_offset): """Return a float dtype that can losslessly represent `dtype` values.""" # Keep float32 as-is. Upcast half-precision to single-precision, # because float16 is "intended for storage but not computation" if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating): return np.float32 # float32 can exactly represent all integers up to 24 bits if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer): # A scale factor is entirely safe (vanishing into the mantissa), # but a large integer offset could lead to loss of precision. # Sensitivity analysis can be tricky, so we just use a float64 # if there's any offset at all - better unoptimised than wrong! if not has_offset: return np.float32 # For all other types and circumstances, we just use float64. # (safe because eg. complex numbers are not supported in NetCDF) return np.float64 Due to calling bug, `has_offset` is always `None`, so this can be simplified to: `python def _choose_float_dtype(dtype) if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating): return np.float32 if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer): return np.float32 return np.float64` Here I call the function twice, once with `has_offset` `False`, then `True`. ```python import numpy as np def _choose_float_dtype(dtype, has_offset): if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating): return np.float32 if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer): if not has_offset: return np.float32 return np.float64 generic types for dtype in [np.byte, np.ubyte, np.short, np.ushort, np.intc, np.uintc, np.int_, np.uint, np.longlong, np.ulonglong, np.half, np.float16, np.single, np.double, np.longdouble, np.csingle, np.cdouble, np.clongdouble, np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64, np.float16, np.float32, np.float64]: print("\|", dtype, "\|", _choose_float_dtype(np.dtype(dtype), False), "\|", _choose_float_dtype(np.dtype(dtype), True), "\|") ``` \| Input \|----------------------------- \| <class 'numpy.int8'> \| <class 'numpy.uint8'> \| <class 'numpy.int16'> \| <class 'numpy.uint16'> \| <class 'numpy.int32'> \| <class 'numpy.uint32'> \| <class 'numpy.int64'> \| <class 'numpy.uint64'> \| <class 'numpy.longlong'> \| <class 'numpy.ulonglong'> \| <class 'numpy.float16'> \| <class 'numpy.float16'> \| <class 'numpy.float32'> \| <class 'numpy.float64'> \| <class 'numpy.float128'> \| <class 'numpy.complex64'> \| <class 'numpy.complex128'> \| <class 'numpy.complex256'> \| <class 'numpy.int8'> \| <class 'numpy.int16'> \| <class 'numpy.int32'> \| <class 'numpy.int64'> \| <class 'numpy.uint8'> \| <class 'numpy.uint16'> \| <class 'numpy.uint32'> \| <class 'numpy.uint64'> \| <class 'numpy.float16'> \| <class 'numpy.float32'> \| <class 'numpy.float64'> \| Output as called \| Output as written \| \|---------------------------\|--------------------------\| \| <class 'numpy.float32'> \| <class 'numpy.float64'> \| \| <class 'numpy.float32'> \| <class 'numpy.float64'> \| \| <class 'numpy.float32'> \| <class 'numpy.float64'> \| \| <class 'numpy.float32'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float32'> \| <class 'numpy.float32'> \| \| <class 'numpy.float32'> \| <class 'numpy.float32'> \| \| <class 'numpy.float32'> \| <class 'numpy.float32'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float32'> \| <class 'numpy.float64'> \| \| <class 'numpy.float32'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float32'> \| <class 'numpy.float64'> \| \| <class 'numpy.float32'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \| \| <class 'numpy.float32'> \| <class 'numpy.float32'> \| \| <class 'numpy.float32'> \| <class 'numpy.float32'> \| \| <class 'numpy.float64'> \| <class 'numpy.float64'> \|	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
1200314984	https://github.com/pydata/xarray/issues/2304#issuecomment-1200314984	https://api.github.com/repos/pydata/xarray/issues/2304	IC_kwDOAMm_X85Hi1po	shoyer 1217238	2022-07-30T23:55:04Z	2022-07-30T23:55:04Z	MEMBER	the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int. It is not advised to unpack an int into a float as there is a potential precision loss. I find this is ambiguous. is `float` above referring to `float16` or `float32`? Is `double` referring to `float64`? Yes, I'm pretty sure "float" means single precision (np.float32), given that "double" certainly means double precision (no.float64). If so, then they do recommend `float64`, as requested by the OP, because the test data is `short` and the `scale_factor` is `float64` (a.k.a `double`?) Yes, I believe so. The broader discussion here is about CF compliance. I find the spec ambiguous and xarray non-compliant. So many tests rely on the existing behavior, that I am unsure how best to proceed to improve compliance. I worry it may be a major refactor, and possibly break things relying on the existing behavior. I'd like to discuss architecture. Should this be in a new issue, if this closes with PR #6851? Should there be a new keyword for `cf_strict` or something? I think we can treat this a bug fix and just go forward with it. Yes, some people are going to be surprised, but I don't think it's distruptive enough that we need to go to a major effort to preserve backwards compatibility. It should already be straightforward to work around by setting `decode_cf=False` when opening a file and then explicitly calling `xarray.decode_cf()`.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
1200266255	https://github.com/pydata/xarray/issues/2304#issuecomment-1200266255	https://api.github.com/repos/pydata/xarray/issues/2304	IC_kwDOAMm_X85HipwP	mankoff 145117	2022-07-30T17:58:51Z	2022-07-30T17:58:51Z	CONTRIBUTOR	This issue, based on its title and initial post, is fixed by PR #6851. The code to select dtype was already correct, but the outer function that called it had a bug in the call. Per the CF spec, the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int. It is not advised to unpack an int into a float as there is a potential precision loss. I find this is ambiguous. is `float` above referring to `float16` or `float32`? Is `double` referring to `float64`? If so, then they do recommend `float64`, as requested by the OP, because the test data is `short` and the `scale_factor` is `float64` (a.k.a `double`?) The broader discussion here is about CF compliance. I find the spec ambiguous and xarray non-compliant. So many tests rely on the existing behavior, that I am unsure how best to proceed to improve compliance. I worry it may be a major refactor, and possibly break things relying on the existing behavior. I'd like to discuss architecture. Should this be in a new issue, if this closes with PR #6851? Should there be a new keyword for `cf_strict` or something?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
1189151229	https://github.com/pydata/xarray/issues/2304#issuecomment-1189151229	https://api.github.com/repos/pydata/xarray/issues/2304	IC_kwDOAMm_X85G4QH9	dcherian 2448579	2022-07-19T14:49:34Z	2022-07-19T14:49:34Z	MEMBER	We'd happily take a PR implementing the suggestion above following CF-conventions. Looking at the dtype for `add_offset` and `scale_factor` does seem like a much cleaner way to handle this issue. I think we should give that a try! IIUC the change should be made here in `_choose_float_dtype`: https://github.com/pydata/xarray/blob/392a61484e80e6ccfd5774b68be51578077d4292/xarray/coding/variables.py#L266-L283	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
1188529343	https://github.com/pydata/xarray/issues/2304#issuecomment-1188529343	https://api.github.com/repos/pydata/xarray/issues/2304	IC_kwDOAMm_X85G14S_	mankoff 145117	2022-07-19T02:35:30Z	2022-07-19T03:20:51Z	CONTRIBUTOR	I've run into this issue too, and the xarray decision to use `float32` is causing problems. I recognize this is a generic floating-point representation issue, but it could be avoided with `float64`. The data value is 1395. The scale is 0.0001. `python val = int(1395) scale = 0.0001 print(valscale) # 0.1395 print( val np.array(scale).astype(float) ) # 0.1395 print( val * np.array(scale).astype(np.float16) ) # 0.1395213... print( val * np.array(scale).astype(np.float32) ) # 0.13949999... print( val * np.array(scale).astype(np.float64) ) # 0.1395` Because we are using `1E3 round()`, the difference between 0.1395 and 0.1394999 (or 139.5 and 139.49) ends up being quite large in the downstream product.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
852069023	https://github.com/pydata/xarray/issues/2304#issuecomment-852069023	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDg1MjA2OTAyMw==	ACHMartin 18679148	2021-06-01T12:03:55Z	2021-06-07T20:48:00Z	NONE	Dear all and thank you for your work on Xarray, Link to @magau comment, I have a netcdf with multiple variables in different format (float, short, byte). Using open_mfdataset 'short' and 'byte' are converted in 'float64' (no scaling, but some masking for the float data). It doesn't raise major issue for me, but it is taking plenty of memory space for nothing. Below an example of the 3 format from (ncdump -h): short total_nobs(time, lat, lon) ; total_nobs:long_name = "Number of SSS in the time interval" ; total_nobs:valid_min = 0s ; total_nobs:valid_max = 10000s ; float pct_var(time, lat, lon) ; pct_var:_FillValue = NaNf ; pct_var:long_name = "Percentage of SSS_variability that is expected to be not explained by the products" ; pct_var:units = "%" ; pct_var:valid_min = 0. ; pct_var:valid_max = 100. ; byte sss_qc(time, lat, lon) ; sss_qc:long_name = "Sea Surface Salinity Quality, 0=Good; 1=Bad" ; sss_qc:valid_min = 0b ; sss_qc:valid_max = 1b ; And how they appear after opening in as xarray using open_mfdataset: `total_nobs (time, lat, lon) float64 dask.array<chunksize=(48, 584, 1388), meta=np.ndarray> pct_var (time, lat, lon) float32 dask.array<chunksize=(48, 584, 1388), meta=np.ndarray> sss_qc (time, lat, lon) float64 dask.array<chunksize=(48, 584, 1388), met` Is there any recommandation? Regards	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
731253022	https://github.com/pydata/xarray/issues/2304#issuecomment-731253022	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDczMTI1MzAyMg==	psybot-ca 66918146	2020-11-20T15:59:13Z	2020-11-20T15:59:13Z	NONE	Hey everyone, tumbled on this while searching for approximately the same problem. Thought I'd share since the issue is still open. On my part, there is two situations that seem buggy. I haven't been using xarray for that long yet so maybe there is something I'm missing here... My first problem relates to the data types of dimensions with float notation. To give another answer to @shoyer's question: To clarify: why is it a problem for you it is a problem in my case because I would like to perform slicing operations of a dataset using longitude values from another dataset. This operation raises a "KeyError : not all values found in index 'longitude'" since either one of the dataset's longitude is float32 and the other is float64 or because both datasets' float32 approximations are not exactly the same value in each dataset. I can work around this and assign new coords to be float64 after reading and it works, though it is kind of a hassle considering I have to perform this thousands of times. This situation also create a problem when concatenating multiple netCDF files together (along time dim in my case). The discrepancies between the approximations of float32 values or the float32 vs float 64 situation will add new dimension values where it shouldn't. On the second part of my problem, it comes with writing/reading netCDF files (maybe more related to @daoudjahdou problem). I tried to change the data type to float64 for all my files, save them and then perform what I need to do, but for some reason even though dtype is float64 for all my dimensions when writing the files (using default args), it will sometime be float32, sometime float64 when reading the files (with default ags values) previously saved with float64 dtype. If using the default args, shouldn't the decoding makes the dtype of dimension the same for all files I read?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
462614107	https://github.com/pydata/xarray/issues/2304#issuecomment-462614107	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQ2MjYxNDEwNw==	shoyer 1217238	2019-02-12T04:46:18Z	2019-02-12T04:46:47Z	MEMBER	@magau thanks for pointing this out -- I think we simplify missed this part of the CF conventions document! Looking at the dtype for `add_offset` and `scale_factor` does seem like a much cleaner way to handle this issue. I think we should give that a try! We will still need some fall-back choice for `CFMaskCoder` if neither a `add_offset` or `scale_factor` attribute is provided (due to xarray's representation of missing values as NaN), but this is a relatively uncommon situation.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
462592638	https://github.com/pydata/xarray/issues/2304#issuecomment-462592638	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQ2MjU5MjYzOA==	magau 791145	2019-02-12T02:48:00Z	2019-02-12T02:48:00Z	NONE	Hi everyone, I've start using xarray recently, so I apologize if I'm saying something wrong... I've also faced the here reported issue, so have tried to find some answers. Unpacking netcdf files with respect to the NUG attributes (scale_factor and add_offset) seems to be mentioned by the CF-Conventions directives. And it's clear about which data type should be applied to the unpacked data. cf-conventions-1.7/packed-data In this chapter you can read that: "If the scale_factor and add_offset attributes are of the same data type as the associated variable, the unpacked data is assumed to be of the same data type as the packed data. However, if the scale_factor and add_offset attributes are of a different data type from the variable (containing the packed data) then the unpacked data should match the type of these attributes". In my opinion this should be the default behavior of the xarray.decode_cf function. Which doesn't invalidate the idea of forcing the unpacked data dtype. However non of the CFScaleOffsetCoder and CFMaskCoder de/encoder classes seems to be according with these CF directives, since the first one doesn't look for the scale_factor or add_offset dtypes, and the second one also changes the unpacked data dtype (maybe because nan values are being used to replace the fill values). Sorry for such an extensive comment, without any solutions proposal... Regards! :+1:	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410792506	https://github.com/pydata/xarray/issues/2304#issuecomment-410792506	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDc5MjUwNg==	Thomas-Z 1492047	2018-08-06T17:47:23Z	2019-01-09T15:18:36Z	CONTRIBUTOR	To explain the full context and why it became some kind of a problem to us : We're experimenting with the parquet format (via pyarrow) and we first did something like : netcdf file -> netcdf4 -> pandas -> pyarrow -> pandas (when read later on). We're now looking at xarray and the huge ease of access it offers to netcdf like data and we tried something similar : netcdf file -> xarray -> pandas -> pyarrow -> pandas (when read later on). Our problem appears when we're reading and comparing the data stored with these 2 approches. The difference between the 2 was - sometimes - larger than what expected/acceptable (10e-6 for float32 if I'm not mistaken). We're not constraining any type and letting the system and modules decide how to encode what and in the end we have significantly different values. There might be something wrong in our process but it originate here with this float32/float64 choice so we thought it might be a problem. Thanks for taking the time to look into this.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
451984471	https://github.com/pydata/xarray/issues/2304#issuecomment-451984471	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQ1MTk4NDQ3MQ==	DevDaoud 971382	2019-01-07T16:04:11Z	2019-01-07T16:04:11Z	NONE	Hi, thank you for your effort into making xarray a great library. As mentioned in the issue the discussion went on a PR in order to make xr.open_dataset parametrable. This post is about asking you about recommendations regarding our PR. In this case we would add a parameter to the open_dataset function called "force_promote" which is a boolean and False by default and thus not mandatory. And then spread that parameter down to the function maybe_promote in dtypes.py Where we say the following: if dtype.itemsize <= 2 and not force_promote: dtype = np.float32 else: dtype = np.float64 The downside of that is that we somehow pollute the code with a parameter that is used in a specific case. The second approach would check the value of an environment variable called "XARRAY_FORCE_PROMOTE" if it exists and set to true would force promoting type to float64. please tells us which approach suits best your vision of xarray. Regards.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
411385081	https://github.com/pydata/xarray/issues/2304#issuecomment-411385081	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMTM4NTA4MQ==	Thomas-Z 1492047	2018-08-08T12:18:02Z	2018-08-22T07:14:58Z	CONTRIBUTOR	So, a more complete example showing this problem. NetCDF file used in the example : test.nc.zip ````python from netCDF4 import Dataset import xarray as xr import numpy as np import pandas as pd d = Dataset("test.nc") v = d.variables['var'] print(v) <class 'netCDF4._netCDF4.Variable'> int16 var(idx) _FillValue: 32767 scale_factor: 0.01 unlimited dimensions: current shape = (2,) filling on df_nc = pd.DataFrame(data={'var': v[:]}) print(df_nc) var 0 21.94 1 27.04 ds = xr.open_dataset("test.nc") df_xr = ds['var'].to_dataframe() Comparing both dataframes with float32 precision (1e-6) mask = np.isclose(df_nc['var'], df_xr['var'], rtol=0, atol=1e-6) print(mask) [False True] print(df_xr) var idx 0 21.939999 1 27.039999 Changing the type and rounding the xarray dataframe df_xr2 = df_xr.astype(np.float64).round(int(np.ceil(-np.log10(ds['var'].encoding['scale_factor'])))) mask = np.isclose(df_nc['var'], df_xr2['var'], rtol=0, atol=1e-6) print(mask) [ True True] print(df_xr2) var idx 0 21.94 1 27.04 ```` As you can see, the problem appears early in the process (not related to the way data are stored in parquet later on) and yes, rounding values does solve it.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
412495621	https://github.com/pydata/xarray/issues/2304#issuecomment-412495621	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMjQ5NTYyMQ==	fmaussion 10050469	2018-08-13T12:04:10Z	2018-08-13T12:04:10Z	MEMBER	I think we are still talking about different things. In the example by @Thomas-Z above there is still a problem at the line: ```python Comparing both dataframes with float32 precision (1e-6) mask = np.isclose(df_nc['var'], df_xr['var'], rtol=0, atol=1e-6) ``` As discussed several times above, this test is misleading: it should assert for `atol=0.01`, which is the real accuracy of the underlying data. For this purpose float32 is more than good enough. @shoyer said: I would be happy to add options for whether to default to float32 or float64 precision. so we would welcome a PR in this direction! I don't think we need to change the default behavior though, as there is a slight possibility that some people are relying on the data being float32.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
412492776	https://github.com/pydata/xarray/issues/2304#issuecomment-412492776	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMjQ5Mjc3Ng==	DevDaoud 971382	2018-08-13T11:51:15Z	2018-08-13T11:51:15Z	NONE	Any updates about this ?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410807622	https://github.com/pydata/xarray/issues/2304#issuecomment-410807622	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDgwNzYyMg==	shoyer 1217238	2018-08-06T18:33:06Z	2018-08-06T18:33:06Z	MEMBER	Please let us know if converting to float64 explicitly and rounding again does not solve this issue for you. On Mon, Aug 6, 2018 at 10:47 AM Thomas Zilio notifications@github.com wrote: To explain the full context and why it became some kind of a problem to us : We're experimenting with the parquet format (via pyarrow) and we first did something like : netcdf file -> netcdf4 -> pandas -> pyarrow -> pandas (when read later on). We're now looking at xarray and the the huge ease of access it offers to netcdf like data and we tried something similar : netcdf file -> xarray -> pandas -> pyarrow -> pandas (when read later on). Our problem appears when we're reading and comparing the data stored with these 2 approches. The difference between the 2 was - sometimes - larger than what expected/acceptable (10e-6 for float32 if I'm not mistaken). We're not constraining any type and letting the system and modules decide how to encode what and in the end we have significantly different values. There might be something wrong in our process but it originate here with this float32/float64 choice so we thought it might be a problem. Thanks for taking the time to look into this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2304#issuecomment-410792506, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1iZHdJnGlkA_dHGHFonA27lIM2xHks5uOIErgaJpZM4VbG9w .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410787443	https://github.com/pydata/xarray/issues/2304#issuecomment-410787443	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDc4NzQ0Mw==	shoyer 1217238	2018-08-06T17:31:22Z	2018-08-06T17:31:22Z	MEMBER	Both multiplying by 0.01 and float32 -> float64 are approximately equivalently expensive. The cost is dominated by the memory copy. On Mon, Aug 6, 2018 at 10:17 AM Ryan May notifications@github.com wrote: Ah, ok, not scaling per-se (i.e. * 0.01), but a second round of value conversion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2304#issuecomment-410782982, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1oEOX3WI7oaPDOQb7R59UgDyPXDsks5uOHozgaJpZM4VbG9w .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410782982	https://github.com/pydata/xarray/issues/2304#issuecomment-410782982	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDc4Mjk4Mg==	dopplershift 221526	2018-08-06T17:17:38Z	2018-08-06T17:17:38Z	CONTRIBUTOR	Ah, ok, not scaling per-se (i.e. `* 0.01`), but a second round of value conversion.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410781556	https://github.com/pydata/xarray/issues/2304#issuecomment-410781556	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDc4MTU1Ng==	shoyer 1217238	2018-08-06T17:13:27Z	2018-08-06T17:13:27Z	MEMBER	I'm not following why the data are scaled twice. We automatically scale the data from int16->float32 upon reading it in xarray (if decode_cf=True). There's no way to turn that off and still get automatic scaling, so the best you can do is layer on int16->float32->float64, when you might prefer to only do int16->float64.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410779271	https://github.com/pydata/xarray/issues/2304#issuecomment-410779271	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDc3OTI3MQ==	dopplershift 221526	2018-08-06T17:06:22Z	2018-08-06T17:06:22Z	CONTRIBUTOR	I'm not following why the data are scaled twice. Your point about the rounding being different is well-taken, though.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410777201	https://github.com/pydata/xarray/issues/2304#issuecomment-410777201	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDc3NzIwMQ==	shoyer 1217238	2018-08-06T17:00:01Z	2018-08-06T17:00:01Z	MEMBER	But since it's a downstream calculation issue, and does not impact the actual precision of what's being read from the file, what's wrong with saying "Use data.astype(np.float64)". It's completely identical to doing it internally to xarray. It's almost but not quite identical. The difference is that the data gets scaled twice. This adds twice the overhead for scaling the values (which to be fair is usually negligible compared to IO). Also, to get exactly equivalent numerics for further computation you would need to round again, e.g., `data.astype(np.float64).round(np.ceil(-np.log10(data.encoding['scale_factor'])))`. This starts to get a little messy :).	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410774955	https://github.com/pydata/xarray/issues/2304#issuecomment-410774955	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDc3NDk1NQ==	dopplershift 221526	2018-08-06T16:52:42Z	2018-08-06T16:52:53Z	CONTRIBUTOR	@shoyer But since it's a downstream calculation issue, and does not impact the actual precision of what's being read from the file, what's wrong with saying "Use `data.astype(np.float64)`". It's completely identical to doing it internally to xarray.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410773312	https://github.com/pydata/xarray/issues/2304#issuecomment-410773312	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDc3MzMxMg==	shoyer 1217238	2018-08-06T16:47:11Z	2018-08-06T16:47:22Z	MEMBER	A float32 values has 24 bits of precision in the significand, which is more than enough to store the 16-bits in in the original data; the exponent (8 bits) will more or less take care of the * 0.01: Right. The actual raw data is being stored as an integer `21940` (along with the scale factor of `0.01`). Both `21.939998626708984` (as float32) and `21.940000000000001` (as float64) are floating point approximations of the exact decimal number `219.40`. I would be happy to add options for whether to default to float32 or float64 precision. There are clearly tradeoffs here: - float32 uses half the memory - float64 has more precision for downstream computation I don't think we can make a statement about which is better in general. The best we can do is make an educated guess about which will be more useful / less surprising for most and/or new users, and pick that as the default.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410769706	https://github.com/pydata/xarray/issues/2304#issuecomment-410769706	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDc2OTcwNg==	dopplershift 221526	2018-08-06T16:34:44Z	2018-08-06T16:36:16Z	CONTRIBUTOR	A float32 values has 24 bits of precision in the significand, which is more than enough to store the 16-bits in in the original data; the exponent (8 bits) will more or less take care of the `* 0.01`: ```python import numpy as np np.float32(2194 * 0.01) 21.94 ``` What you're seeing is an artifact of printing out the values. I have no idea why something is printing out a float (only 7 decimal digits) out to 17 digits. Even float64 only has 16 digits (which is overkill for this application). The difference in subtracting the 32- and 64-bit values above are in the 8th decimal place, which is beyond the actual precision of the data; what you've just demonstrated is the difference in precision between 32-bit and 64-bit values, but it had nothing to do whatsoever with the data. If you're really worried about precision round-off for things like std. dev, you should probably calculate it using the raw integer values and scale afterwards. (I don't actually think this is necessary, though.)	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410680371	https://github.com/pydata/xarray/issues/2304#issuecomment-410680371	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDY4MDM3MQ==	fmaussion 10050469	2018-08-06T11:41:38Z	2018-08-06T11:41:38Z	MEMBER	As mentioned in the original issue the modification is straightforward. Any ideas if this could be integrated to xarray anytime soon ? Some people might prefer float32, so it is not as straightforward as it seems. It might be possible to add an option for this, but I didn't look into the details. You'll have a float64 in the end but you won't get your precision back Note that this is a fake sense of precision, because in the example above the compression used is lossy, i.e. precision was lost at compression and the actual precision is now 0.01: `short agc_40hz(time, meas_ind) ; agc_40hz:_FillValue = 32767s ; agc_40hz:units = "dB" ; agc_40hz:scale_factor = 0.01 ;`	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410678021	https://github.com/pydata/xarray/issues/2304#issuecomment-410678021	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDY3ODAyMQ==	DevDaoud 971382	2018-08-06T11:31:00Z	2018-08-06T11:31:00Z	NONE	As mentioned in the original issue the modification is straightforward. Any ideas if this could be integrated to xarray anytime soon ?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
410675562	https://github.com/pydata/xarray/issues/2304#issuecomment-410675562	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQxMDY3NTU2Mg==	Thomas-Z 1492047	2018-08-06T11:19:30Z	2018-08-06T11:19:30Z	CONTRIBUTOR	You're right when you say Note that it's very easy to later convert from float32 to float64, e.g., by writing ds.astype(np.float64). You'll have a float64 in the end but you won't get your precision back and it might be a problem in some case. I understand the benefits of using float32 on the memory side but it is kind of a problem for us each time we have variables using scale factors. I'm surprised this issue (if considered as one) does not bother more people.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
407092265	https://github.com/pydata/xarray/issues/2304#issuecomment-407092265	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQwNzA5MjI2NQ==	DevDaoud 971382	2018-07-23T15:10:13Z	2018-07-23T15:10:13Z	NONE	Thank you for your quick answer. In our case we could evaluate std dev or square sums on long lists of values and the accumulation of those small values due to float32 type could create considerable differences.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822
407087615	https://github.com/pydata/xarray/issues/2304#issuecomment-407087615	https://api.github.com/repos/pydata/xarray/issues/2304	MDEyOklzc3VlQ29tbWVudDQwNzA4NzYxNQ==	shoyer 1217238	2018-07-23T14:57:20Z	2018-07-23T14:57:20Z	MEMBER	To clarify: why is it a problem for you to get floating point values like 21.939998626708984 instead of 21.940000000000001? Is it a loss of precision in some downstream calculation? Both numbers are accurate well within the precision indicated by the netCDF file (0.01). Note that it's very easy to later convert from float32 to float64, e.g., by writing `ds.astype(np.float64)`.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 343659822

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

30 rows where issue = 343659822 sorted by updated_at descending

Packing Qs

Unpacking Qs

Current algorithm

generic types

<class 'netCDF4._netCDF4.Variable'>

int16 var(idx)

_FillValue: 32767

scale_factor: 0.01

unlimited dimensions:

current shape = (2,)

filling on

var

0 21.94

1 27.04

Comparing both dataframes with float32 precision (1e-6)

[False True]

var

idx

0 21.939999

1 27.039999

Changing the type and rounding the xarray dataframe

[ True True]

var

idx

0 21.94

1 27.04

Comparing both dataframes with float32 precision (1e-6)

Advanced export