home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

13 rows where issue = 942738904 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 5

  • ohsqueezy 5
  • kmuehlbauer 3
  • shoyer 2
  • max-sixty 2
  • keewis 1

author_association 2

  • MEMBER 8
  • NONE 5

issue 1

  • Decoding netCDF is giving incorrect values for a large file · 13 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1492937244 https://github.com/pydata/xarray/issues/5597#issuecomment-1492937244 https://api.github.com/repos/pydata/xarray/issues/5597 IC_kwDOAMm_X85Y_Goc kmuehlbauer 5821660 2023-04-01T11:03:02Z 2023-04-01T11:03:02Z MEMBER

To fix this, I think logic in _choose_float_dtype should be updated to look at encoding['dtype'] (if available) instead of dtype, in order to understand how the data was originally stored.

This is aimed at in #7654

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
879561954 https://github.com/pydata/xarray/issues/5597#issuecomment-879561954 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3OTU2MTk1NA== shoyer 1217238 2021-07-14T03:43:37Z 2021-07-14T03:44:00Z MEMBER

Thanks for sharing the subset netCDF file, that is very helpful for debugging indeed!

The weird thing is that the dtype picking logic seems to have a special case that, per the code comment, suggesting we want to be using float64 here: https://github.com/pydata/xarray/blob/eea76733770be03e78a0834803291659136bca31/xarray/coding/variables.py#L231-L238

But in fact, the dtype picking logic doesn't do that, because the dtype is already converted into float32, first. The culprit seems to be this line in CFMaskCoder, which promotes the dtype to float32 to fit a fill-value of NaN: https://github.com/pydata/xarray/blob/eea76733770be03e78a0834803291659136bca31/xarray/coding/variables.py#L202

To fix this, I think logic in _choose_float_dtype should be updated to look at encoding['dtype'] (if available) instead of dtype, in order to understand how the data was originally stored.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
879554360 https://github.com/pydata/xarray/issues/5597#issuecomment-879554360 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3OTU1NDM2MA== ohsqueezy 1373406 2021-07-14T03:19:53Z 2021-07-14T03:19:53Z NONE

That explains it to me! Not sure if it's still useful but I exported the subset as a netCDF file. ```python In [59]: packed_vals = xarray.open_dataset("packed_solar_data_subset.nc", mask_and_scale=False).ssrd.values

In [60]: packed_vals[0] * numpy.float32(e["scale_factor"]) + numpy.float32(e["add_offset"])
Out[60]: 2.0

In [61]: packed_vals[0] * numpy.float64(e["scale_factor"]) + numpy.float64(e["add_offset"])
Out[61]: 0.0 Hm actually I think converting the packed vals to 64 bit and then decoding does what I'm looking forpython In [62]: xarray.decode_cf(xarray.open_dataset("packed_solar_data_subset.nc", mask_and_scale=False).astype(numpy.float64)).ssrd.values
Out[62]: array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 25651.61906215, 354743.1221522 , 1091757.933255 , 2170377.23235622, 3482363.69999847, 4704882.32554591, 5689654.23783437, 6297785.304381 , 6534906.36839455, 6543665.4578304 , 6543665.4578304 ]) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
879361320 https://github.com/pydata/xarray/issues/5597#issuecomment-879361320 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3OTM2MTMyMA== shoyer 1217238 2021-07-13T19:58:39Z 2021-07-13T19:58:39Z MEMBER

This may just be the expected floating point error from using float32: ``` In [5]: import numpy as np

In [6]: -32766 * np.float32(625.6492454183389) + np.float32(20500023.17537729) Out[6]: 1.2984619140625 ```

If you use full float64, then the data does decode to 0.0: In [7]: -32766 * np.float64(625.6492454183389) + np.float64(20500023.17537729) Out[7]: 0.0

So the question then is why this ends up being decoded using float32 instead of float64, and if that logic should be adjusted or made customizable: https://github.com/pydata/xarray/blob/eea76733770be03e78a0834803291659136bca31/xarray/coding/variables.py#L225

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
879282134 https://github.com/pydata/xarray/issues/5597#issuecomment-879282134 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3OTI4MjEzNA== ohsqueezy 1373406 2021-07-13T17:49:09Z 2021-07-13T17:49:09Z NONE

sure, no prob python $ xarray.open_dataset("BIG_FILE_packed.nc").ssrd.encoding {'source': 'BIG_FILE_packed.nc', 'original_shape': (743, 1801, 3600), 'dtype': dtype('int16'), 'missing_value': -32767, '_FillValue': -32767, 'scale_factor': 625.6492454183389, 'add_offset': 20500023.17537729}

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
879199563 https://github.com/pydata/xarray/issues/5597#issuecomment-879199563 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3OTE5OTU2Mw== keewis 14808389 2021-07-13T15:45:05Z 2021-07-13T15:45:05Z MEMBER

can you post the value of xarray.open_dataset("BIG_FILE_packed.nc").ssrd.encoding?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
878920991 https://github.com/pydata/xarray/issues/5597#issuecomment-878920991 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3ODkyMDk5MQ== ohsqueezy 1373406 2021-07-13T09:16:03Z 2021-07-13T09:16:03Z NONE

h5netcdf seems to be a separate issue for me as it gives me the error OSError: Unable to open file (file signature not found) I looked into it once though, and I think I might be able to fix that. I'll also see if I can build a small netCDF that has reproducible behavior!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
878914016 https://github.com/pydata/xarray/issues/5597#issuecomment-878914016 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3ODkxNDAxNg== kmuehlbauer 5821660 2021-07-13T09:07:14Z 2021-07-13T09:07:14Z MEMBER

@ohsqueezy You might also try engine="h5netcdf (h5py/h5netcdf packages needed). And would it be possible create a small subset of that file via netCDF4 to share?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
878910801 https://github.com/pydata/xarray/issues/5597#issuecomment-878910801 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3ODkxMDgwMQ== ohsqueezy 1373406 2021-07-13T09:03:04Z 2021-07-13T09:03:04Z NONE

Thanks for your help!

I checked using the netCDF4 module, and the data is returned correctly ```python $ d = netCDF4.Dataset("BIG_FILE_packed.nc") $ d["ssrd"][d["time"][:] < d["time"][24], d["latitude"][:] == 44.8, d["longitude"][:] == 287.1]

masked_array( data=[[[ 0. ]], [[ 0. ]], [[ 0. ]], [[ 0. ]], [[ 0. ]], [[ 0. ]], [[ 0. ]], [[ 0. ]], [[ 0. ]], [[ 0. ]], [[ 0. ]], [[ 0. ]], [[ 25651.61906215]], [[ 354743.1221522 ]], [[1091757.933255 ]], [[2170377.23235622]], [[3482363.69999847]], [[4704882.32554591]], [[5689654.23783437]], [[6297785.304381 ]], [[6534906.36839455]], [[6543665.4578304 ]], [[6543665.4578304 ]], [[6543665.4578304 ]]], mask=False, fill_value=1e+20) I tried with `scipy` as the engine, and it still returns the 2 valuespython $ xarray.open_dataset("BIG_FILE_packed.nc", engine="scipy").ssrd.sel(latitude=44.8, longitude=287.1, method="nearest").values[:23]
array([2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00, 2.565200e+04, 3.547440e+05, 1.091760e+06, 2.170378e+06, 3.482364e+06, 4.704884e+06, 5.689655e+06, 6.297786e+06, 6.534908e+06, 6.543667e+06, 6.543667e+06], dtype=float32) ``` I should mention that in another large packed dataset from this API, I have gotten the same error but with a very small decimal value in place of the zero instead of 2.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
878833159 https://github.com/pydata/xarray/issues/5597#issuecomment-878833159 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3ODgzMzE1OQ== max-sixty 5635139 2021-07-13T07:03:43Z 2021-07-13T07:03:43Z MEMBER

Thanks. Does passing different values to engine= make any difference? I suspect the issue is at that layer.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
878830308 https://github.com/pydata/xarray/issues/5597#issuecomment-878830308 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3ODgzMDMwOA== kmuehlbauer 5821660 2021-07-13T06:58:24Z 2021-07-13T06:58:24Z MEMBER

@ohsqueezy Does this issue also show up, if just plain netCDF4 is used to open the files?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
878824626 https://github.com/pydata/xarray/issues/5597#issuecomment-878824626 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3ODgyNDYyNg== ohsqueezy 1373406 2021-07-13T06:46:55Z 2021-07-13T06:46:55Z NONE

That example is actually a different file than the original. I unpacked the original file externally using ncpdq -U BIG_FILE_packed.nc BIG_FILE_unpacked.nc before opening it with xarray, so the decoding step is skipped and there aren't any 2 values generated. The data is correct using that method, so it's a possible workaround, but unpacking externally makes each file 4x larger.

In all the examples, the data is the same time and location, so they should be the same values outside of whatever is lost from compressing to int16 and decompressing, and the output arrays are from selecting a single day (24 hours) at a single location from the dataset returned by open_dataset in the ipython interpreter.

So actually there are three files I've tested with, all of which should have the same data (assuming the issue isn't with how the files are built, which could be the case): BIG_FILE_packed.nc BIG_FILE_unpacked.nc and SMALL_FILE_packed.nc, and the only one that displays the issue is the first one.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
878792388 https://github.com/pydata/xarray/issues/5597#issuecomment-878792388 https://api.github.com/repos/pydata/xarray/issues/5597 MDEyOklzc3VlQ29tbWVudDg3ODc5MjM4OA== max-sixty 5635139 2021-07-13T05:32:27Z 2021-07-13T05:32:27Z MEMBER

A small question to clarify:

When the netCDF is unpacked using the nco command line tool, the correct values are unpacked. ```python $ xarray.open_dataset("BIG_FILE_unpacked.nc").ssrd.isel(time=slice(0, 23)).sel(latitude=44.8, longitude=287.1, method="nearest").values

array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 25651.61906215, 354743.1221522 , 1091757.933255 , 2170377.23235622, 3482363.69999847, 4704882.32554591, 5689654.23783437, 6297785.304381 , 6534906.36839455, 6543665.4578304 , 6543665.4578304 ]) ```

Is that the output of the .open_dataset command? Is that the same code that generates the 2s? Or is the command that generates the zero different?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 30.605ms · About: xarray-datasette