home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

3 rows where issue = 343659822 and user = 1492047 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • Thomas-Z · 3 ✖

issue 1

  • float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray · 3 ✖

author_association 1

  • CONTRIBUTOR 3
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
410792506 https://github.com/pydata/xarray/issues/2304#issuecomment-410792506 https://api.github.com/repos/pydata/xarray/issues/2304 MDEyOklzc3VlQ29tbWVudDQxMDc5MjUwNg== Thomas-Z 1492047 2018-08-06T17:47:23Z 2019-01-09T15:18:36Z CONTRIBUTOR

To explain the full context and why it became some kind of a problem to us :

We're experimenting with the parquet format (via pyarrow) and we first did something like : netcdf file -> netcdf4 -> pandas -> pyarrow -> pandas (when read later on).

We're now looking at xarray and the huge ease of access it offers to netcdf like data and we tried something similar : netcdf file -> xarray -> pandas -> pyarrow -> pandas (when read later on).

Our problem appears when we're reading and comparing the data stored with these 2 approches. The difference between the 2 was - sometimes - larger than what expected/acceptable (10e-6 for float32 if I'm not mistaken). We're not constraining any type and letting the system and modules decide how to encode what and in the end we have significantly different values.

There might be something wrong in our process but it originate here with this float32/float64 choice so we thought it might be a problem.

Thanks for taking the time to look into this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
411385081 https://github.com/pydata/xarray/issues/2304#issuecomment-411385081 https://api.github.com/repos/pydata/xarray/issues/2304 MDEyOklzc3VlQ29tbWVudDQxMTM4NTA4MQ== Thomas-Z 1492047 2018-08-08T12:18:02Z 2018-08-22T07:14:58Z CONTRIBUTOR

So, a more complete example showing this problem. NetCDF file used in the example : test.nc.zip

````python from netCDF4 import Dataset import xarray as xr import numpy as np import pandas as pd

d = Dataset("test.nc") v = d.variables['var']

print(v)

<class 'netCDF4._netCDF4.Variable'>

int16 var(idx)

_FillValue: 32767

scale_factor: 0.01

unlimited dimensions:

current shape = (2,)

filling on

df_nc = pd.DataFrame(data={'var': v[:]})

print(df_nc)

var

0 21.94

1 27.04

ds = xr.open_dataset("test.nc") df_xr = ds['var'].to_dataframe()

Comparing both dataframes with float32 precision (1e-6)

mask = np.isclose(df_nc['var'], df_xr['var'], rtol=0, atol=1e-6)

print(mask)

[False True]

print(df_xr)

var

idx

0 21.939999

1 27.039999

Changing the type and rounding the xarray dataframe

df_xr2 = df_xr.astype(np.float64).round(int(np.ceil(-np.log10(ds['var'].encoding['scale_factor'])))) mask = np.isclose(df_nc['var'], df_xr2['var'], rtol=0, atol=1e-6)

print(mask)

[ True True]

print(df_xr2)

var

idx

0 21.94

1 27.04

````

As you can see, the problem appears early in the process (not related to the way data are stored in parquet later on) and yes, rounding values does solve it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
410675562 https://github.com/pydata/xarray/issues/2304#issuecomment-410675562 https://api.github.com/repos/pydata/xarray/issues/2304 MDEyOklzc3VlQ29tbWVudDQxMDY3NTU2Mg== Thomas-Z 1492047 2018-08-06T11:19:30Z 2018-08-06T11:19:30Z CONTRIBUTOR

You're right when you say

Note that it's very easy to later convert from float32 to float64, e.g., by writing ds.astype(np.float64).

You'll have a float64 in the end but you won't get your precision back and it might be a problem in some case.

I understand the benefits of using float32 on the memory side but it is kind of a problem for us each time we have variables using scale factors.

I'm surprised this issue (if considered as one) does not bother more people.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.957ms · About: xarray-datasette