home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

12 rows where issue = 1681353195 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • kmuehlbauer 6
  • dcherian 3
  • Articoking 2
  • welcome[bot] 1

author_association 3

  • MEMBER 9
  • CONTRIBUTOR 2
  • NONE 1

issue 1

  • xr.open_dataset() reading ubyte variables as float32 from DAP server · 12 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1525705799 https://github.com/pydata/xarray/issues/7782#issuecomment-1525705799 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85a8GxH kmuehlbauer 5821660 2023-04-27T13:33:50Z 2023-04-27T13:33:50Z MEMBER

As we can see from the above output, in netCDF4-python scaling is adapting the dtype to unsigned, not masking. This is also reflected in the docs unidata.github.io/netcdf4-python/#Variable.

Do we know why this is so?

TL;DR: NETCDF3 detail to allow (signal) unsigned integer, still used in recent formats

  • more discussion details on this over at https://github.com/Unidata/netcdf4-python/issues/656
  • at NetCDF Users Guide on packed data:

A conventional way to indicate whether a byte, short, or int variable is meant to be interpreted as unsigned, even for the netCDF-3 classic model that has no external unsigned integer type, is by providing the special variable attribute _Unsigned with value "true". However, most existing data for which packed values are intended to be interpreted as unsigned are stored without this attribute, so readers must be aware of packing assumptions in this case. In the enhanced netCDF-4 data model, packed integers may be declared to be of the appropriate unsigned type.

My suggestion would be to nudge the user by issuing warnings and link to new to be added documentation on the topic. This could be in line with the cf-coding conformance checks which have been discussed yesterday in the dev-meeting.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1523618985 https://github.com/pydata/xarray/issues/7782#issuecomment-1523618985 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85a0JSp dcherian 2448579 2023-04-26T15:29:14Z 2023-04-26T15:29:14Z MEMBER

Thanks for the in-depth investigation!

As we can see from the above output, in netCDF4-python scaling is adapting the dtype to unsigned, not masking. This is also reflected in the docs unidata.github.io/netcdf4-python/#Variable.

Do we know why this is so?

If Xarray is trying to align with netCDF4-python it should separate mask and scale as netCDF4-python is doing. It does this already by using different coders but it doesn't separate it API-wise.

:+1:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1522997083 https://github.com/pydata/xarray/issues/7782#issuecomment-1522997083 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85axxdb kmuehlbauer 5821660 2023-04-26T08:28:39Z 2023-04-26T08:28:39Z MEMBER

This is how netCDF4-python handles this data with different parameters:

python import netCDF4 as nc with nc.Dataset("http://dap.ceda.ac.uk/thredds/dodsC/neodc/esacci/snow/data/scfv/MODIS/v2.0/2010/01/20100101-ESACCI-L3C_SNOW-SCFV-MODIS_TERRA-fv2.0.nc") as ds_dap: v = ds_dap["scfv"] print(v) print("\n- default") print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- maskandscale False") ds_dap.set_auto_maskandscale(False) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- mask/scale False") ds_dap.set_auto_mask(False) ds_dap.set_auto_scale(False) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- mask True / scale False") ds_dap.set_auto_mask(True) ds_dap.set_auto_scale(False) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- mask False / scale True") ds_dap.set_auto_mask(False) ds_dap.set_auto_scale(True) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- mask True / scale True") ds_dap.set_auto_mask(True) ds_dap.set_auto_scale(True) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- maskandscale True") ds_dap.set_auto_mask(False) ds_dap.set_auto_scale(False) ds_dap.set_auto_maskandscale(True) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") ```python <class 'netCDF4._netCDF4.Variable'> int8 scfv(time, lat, lon) _Unsigned: true _FillValue: -1 standard_name: snow_area_fraction_viewable_from_above long_name: Snow Cover Fraction Viewable units: percent valid_range: [ 0 -2] actual_range: [ 0 100] flag_values: [-51 -50 -46 -41 -4 -3 -2] flag_meanings: Cloud Polar_Night_or_Night Water Permanent_Snow_and_Ice Classification_failed Input_Data_Error No_Satellite_Acquisition missing_value: -1 ancillary_variables: scfv_unc grid_mapping: spatial_ref _ChunkSizes: [ 1 1385 2770] unlimited dimensions: time current shape = (1, 18000, 36000) filling off

  • default variable dtype: int8 first 2 elements: uint8 [215 215] last 2 elements: uint8 [215 215]

  • maskandscale False variable dtype: int8 first 2 elements: int8 [-41 -41] last 2 elements: int8 [-41 -41]

  • mask/scale False variable dtype: int8 first 2 elements: int8 [-41 -41] last 2 elements: int8 [-41 -41]

  • mask True / scale False variable dtype: int8 first 2 elements: int8 [-- --] last 2 elements: int8 [-- --]

  • mask False / scale True variable dtype: int8 first 2 elements: uint8 [215 215] last 2 elements: uint8 [215 215]

  • mask True / scale True variable dtype: int8 first 2 elements: uint8 [215 215] last 2 elements: uint8 [215 215]

  • maskandscale True variable dtype: int8 first 2 elements: uint8 [215 215] last 2 elements: uint8 [215 215] ```

First, the dataset was created with filling off (read more about that in the netcdf file format specs https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html). This should not be a problem for the analysis, but it tells us that all data points should have been written to somehow.

As we can see from the above output, in netCDF4-python scaling is adapting the dtype to unsigned, not masking. This is also reflected in the docs https://unidata.github.io/netcdf4-python/#Variable.

If Xarray is trying to align with netCDF4-python it should separate mask and scale as netCDF4-python is doing. It does this already by using different coders but it doesn't separate it API-wise.

We would need a similar approach here for Xarray with additional kwargs scale and mask in addition to mask_and_scale. We cannot just move the UnsignedCoder out of mask_and_scale and apply it unconditionally.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520804745 https://github.com/pydata/xarray/issues/7782#issuecomment-1520804745 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85apaOJ kmuehlbauer 5821660 2023-04-24T20:47:43Z 2023-04-24T20:47:43Z MEMBER

@dcherian The main issue here is that we have two different CF things which are applied, Unsigned and _FillValue/missing_value.

For netcdf4-python the values would just be masked and the dtype would be preserved. For xarray it will be cast to float32 because of the _FillValue/missing_value.

I agree, moving the Unsigned Coder out of mask_and_scale should help in that particular case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520550980 https://github.com/pydata/xarray/issues/7782#issuecomment-1520550980 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85aocRE dcherian 2448579 2023-04-24T17:18:37Z 2023-04-24T19:55:11Z MEMBER

We would want to check the different attributes and apply the coders only as needed.

The current approach seeems OK no? It seems like the bug is that UnsignedMaskCodershould be outside if mask_and_scale

We would want to check the different attributes and apply the coders only as needed.

EDIT: I mean that each coder checks whether it is applicable, so we already do that

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520514792 https://github.com/pydata/xarray/issues/7782#issuecomment-1520514792 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85aoTbo kmuehlbauer 5821660 2023-04-24T16:52:30Z 2023-04-24T16:52:30Z MEMBER

@dcherian Yes, that would work.

We would want to check the different attributes and apply the coders only as needed. That might need some refactoring. I'm already wrapping my head around this for several weeks now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520434316 https://github.com/pydata/xarray/issues/7782#issuecomment-1520434316 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85an_yM dcherian 2448579 2023-04-24T15:55:48Z 2023-04-24T15:55:48Z MEMBER

mask_and_scale=False will also deactivate the Unsigned decoding.

Do these two have to be linked? I wonder if we can handle the filling later : https://github.com/pydata/xarray/blob/2657787f76fffe4395288702403a68212e69234b/xarray/coding/variables.py#L397-L407

It seems like this code is setting fill values to the right type for CFMaskCoder which is the next step

https://github.com/pydata/xarray/blob/2657787f76fffe4395288702403a68212e69234b/xarray/conventions.py#L266-L272

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520409398 https://github.com/pydata/xarray/issues/7782#issuecomment-1520409398 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85an5s2 Articoking 90768774 2023-04-24T15:39:50Z 2023-04-24T15:39:50Z CONTRIBUTOR

Your suggestion worked perfectly, thank you very much! Avoiding using astype() reduced processing time massively

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520363622 https://github.com/pydata/xarray/issues/7782#issuecomment-1520363622 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85anuhm kmuehlbauer 5821660 2023-04-24T15:10:24Z 2023-04-24T15:11:00Z MEMBER

Then you are somewhat deadlocked. mask_and_scale=False will also deactivate the Unsigned decoding.

You might be able to achieve what want by using decode_cf=False (completely deactivate cf decoding). Then you would have to remove _FillValue attribute as well as missing_value attribute from the variables. Finally, you can run xr.decode_cf(ds) to correctly decode your data.

I'll add a code example tomorrow if no one beats me to it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520341470 https://github.com/pydata/xarray/issues/7782#issuecomment-1520341470 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85anpHe Articoking 90768774 2023-04-24T14:58:36Z 2023-04-24T14:58:36Z CONTRIBUTOR

Thank you for your quick reply. Adding the mask_and_scale=False kwarg solves the issue of conversion to float, but the resulting is of dtype int8 instead of uint8. Is there any way of making open_dataset() directly interpret the values as unsigned?

It would save me quite a lot of processing time since using DataArray.astype(np.uint8) takes a while to run.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520277594 https://github.com/pydata/xarray/issues/7782#issuecomment-1520277594 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85anZha kmuehlbauer 5821660 2023-04-24T14:31:00Z 2023-04-24T14:31:00Z MEMBER

@Articoking

As both variables have a _FillValue attached xarray converts these values to NaN effectively casting to float32 in this case.

You might inspect the .encoding-property of the respective variables to get information of the source dtype.

You can deactivate the automatic conversion by adding kwarg mask_and_scale=False.

There is more information in the docs https://docs.xarray.dev/en/stable/user-guide/io.html

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520222850 https://github.com/pydata/xarray/issues/7782#issuecomment-1520222850 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85anMKC welcome[bot] 30606887 2023-04-24T14:04:15Z 2023-04-24T14:04:15Z NONE

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.757ms · About: xarray-datasette