home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

4 rows where issue = 1655569401 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 2

  • dcherian 2
  • kmuehlbauer 2

issue 1

  • default fill_value not masked when read from file · 4 ✖

author_association 1

  • MEMBER 4
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1498647087 https://github.com/pydata/xarray/issues/7723#issuecomment-1498647087 https://api.github.com/repos/pydata/xarray/issues/7723 IC_kwDOAMm_X85ZU4ov kmuehlbauer 5821660 2023-04-06T08:00:09Z 2023-04-06T08:00:09Z MEMBER

I'm still convinced this could be fixed for floating point data.

Generally its worse if we obey some default fill values but not others, because it becomes quite confusing to a user.

I think this depends from which side you look at it :-) My point here is, we do not have to submissively obey to default fill values, but just use them when decoding. This only need to happen if no _FillValue is attached to the variable. By doing this we ensure that these missing values are mapped to np.nan (as it is expected by users).

In further course we can just apply the xarray standard np.nan when writing out. We need to document that in that case exact roundtrip isn't possible (it also isn't currently possible, in this example).

Consider this example:

```python dtype = "f4" with nc.Dataset("test-fillvalues-01.nc", mode="w") as ds: x = ds.createDimension("x", 10) test_fillval_fillon = ds.createVariable("test_fillval_fillon", dtype, ("x",), fill_value=nc.default_fillvals[dtype]) test_fillval_fillon[:5] = np.array([0.0, nc.default_fillvals[dtype], np.nan, 1.0, 8.0], dtype=dtype) test_nofillval_fillon = ds.createVariable("test_nofillval_fillon", dtype, ("x",), fill_value=None) test_nofillval_fillon[:5] = np.array([0.0, nc.default_fillvals[dtype], np.nan, 1.0, 8.0], dtype=dtype)

with nc.Dataset("test-fillvalues-01.nc") as ds: print("\n read with netCDF4-python") print("---------------------------") print(ds["test_fillval_fillon"]) print(ds["test_fillval_fillon"][:]) print(ds["test_nofillval_fillon"]) print(ds["test_nofillval_fillon"][:])

with xr.open_dataset("test-fillvalues-01.nc").load() as ds: print("\n read with xarray") print("---------------------------") print(ds["test_fillval_fillon"]) print(ds["test_fillval_fillon"][:]) print(ds["test_nofillval_fillon"]) print(ds["test_nofillval_fillon"][:]) python read with netCDF4-python


<class 'netCDF4._netCDF4.Variable'> float32 test_fillval_fillon(x) _FillValue: 9.96921e+36 unlimited dimensions: current shape = (10,) filling on [0.0 -- nan 1.0 8.0 -- -- -- -- --] <class 'netCDF4._netCDF4.Variable'> float32 test_nofillval_fillon(x) unlimited dimensions: current shape = (10,) filling on, default _FillValue of 9.969209968386869e+36 used [0.0 -- nan 1.0 8.0 -- -- -- -- --]

read with xarray-python

<xarray.DataArray 'test_fillval_fillon' (x: 10)> array([ 0., nan, nan, 1., 8., nan, nan, nan, nan, nan], dtype=float32) Dimensions without coordinates: x <xarray.DataArray 'test_nofillval_fillon' (x: 10)> array([0.00000e+00, 9.96921e+36, nan, 1.00000e+00, 8.00000e+00, 9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36], dtype=float32) Dimensions without coordinates: x ```

The only difference between these two variables is that on the first the _FillValue is declared, and on the other the default _FillValue is used. So if xarray obeys (by CF standard) the first it should also obey the second.

This might just work, if these cases the default fillvalue is used for decoding to np.nan, and declared that np.nan will be the new _FillValue. Does that make sense?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  default fill_value not masked when read from file 1655569401
1498468954 https://github.com/pydata/xarray/issues/7723#issuecomment-1498468954 https://api.github.com/repos/pydata/xarray/issues/7723 IC_kwDOAMm_X85ZUNJa dcherian 2448579 2023-04-06T04:15:06Z 2023-04-06T04:15:06Z MEMBER

Would be a good idea to document this behaviour.

+1

Maybe yet another keyword switch, use_default_fillvalues?

Adding mask_default_netcdf_fill_values: bool is probably a good idea.

I'm still convinced this could be fixed for floating point data.

Generally its worse if we obey some default fill values but not others, because it becomes quite confusing to a user.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  default fill_value not masked when read from file 1655569401
1498464352 https://github.com/pydata/xarray/issues/7723#issuecomment-1498464352 https://api.github.com/repos/pydata/xarray/issues/7723 IC_kwDOAMm_X85ZUMBg kmuehlbauer 5821660 2023-04-06T04:09:11Z 2023-04-06T04:09:11Z MEMBER

@dcherian Great, a duplicate. :-( Sorry I must have overlooked that one.

It's somewhat counter-intuitive to get differing results when using netcdf4-python and xarray. Would be a good idea to document this behaviour.

It looks like it might at least be resolved for floating point source data:

Let's take the above simple example. We have np.nan written to the file, but the netcdf representation on disk uses a default (undeclared by attribute) _FillValue for unwritten parts.

For the netcdf4-python user the np.nan will not be masked, but the unfilled parts will be masked.

For xarray the default fillvalue won't be masked, appearing as valid data, which it is not. On subsequent writes np.nan will be introduced as the new fillvalue (by attribute), effectively changing the meaning of the default fillvalues.

Wouldn't it make sense then, to transform these default fill values to np.nan on read too, instead of giving the a seemingly meaningful value? Maybe yet another keyword switch, use_default_fillvalues?

There should be at least a warning on read, in these situations, that there are undefined values in the dataset which were never written and which will not be masked.

If the dataset contains unwritten parts, and a default fillvalue is used, in turn meaning the data creator did this by purpose (by not setting a _FillValue) it can mean several things:

  • The creators data does actually not have missing values which need declaring, but it means, that his data will get masked for default fillvalue entries (maybe they doesn't know about this, but that might be unlikely).
  • The creator doesn't care at all, with same conclusion as above.
  • The creator purposefully uses default fillvalue as missing value, since they use this as a means of saving disk space. But this could also be done, by just defining that as _FillValue attribute at creation time, if I`m not mistaken.

I'm still convinced this could be fixed for floating point data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  default fill_value not masked when read from file 1655569401
1498403174 https://github.com/pydata/xarray/issues/7723#issuecomment-1498403174 https://api.github.com/repos/pydata/xarray/issues/7723 IC_kwDOAMm_X85ZT9Fm dcherian 2448579 2023-04-06T02:24:34Z 2023-04-06T02:24:34Z MEMBER

See https://github.com/pydata/xarray/pull/5680#issuecomment-895508489

To follow up, from a practical perspective, there are two problems with assuming that there are always "truly missing values" (case 2):

It makes it impossible to represent the full range of values in a data type, e.g., 255 for uint8 now means "missing". Due to unfortunately limited options for representing missing data in NumPy, Xarray represents truly missing values in its data model with "NaN". This is more or less OK for floating point data, but means that integer data gets converted into floats. For example, uint8 would now get automatically converted into float32.

Both of these issues are problematic for faithful "round tripping" of Xarray data into netCDF and back. For this reason, Xarray needs an unambiguous way to know if a netCDF variable could contain semantically missing values. So far, we've used the presence of missing_value and _FillValue attributes for that.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  default fill_value not masked when read from file 1655569401

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 24.907ms · About: xarray-datasette