home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 267542085

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
267542085 MDU6SXNzdWUyNjc1NDIwODU= 1647 Representing missing values in string arrays on disk 1217238 closed 0     3 2017-10-23T05:01:10Z 2024-02-06T13:03:40Z 2024-02-06T13:03:40Z MEMBER      

This came up as part of my clean-up of serializing unicode strings in https://github.com/pydata/xarray/pull/1648.

There are two ways to represent strings in netCDF files.

  • As character arrays (NC_CHAR), supported by both netCDF3 and netCDF4
  • As variable length unicode strings (NC_STRING), only supported by netCDF4/HDF5.

Currently, by default (if no _FillValue is set) we replace missing values (NaN) with an empty string when writing data to disk.

For character arrays, we could use the normal _FillValue mechanism to set a fill value and decode when data is read back from disk. In fact, this already currently works for dtype=bytes (though it isn't documented): ``` In [10]: ds = xr.Dataset({'foo': ('x', np.array([b'bar', np.nan], dtype=object), {}, {'_FillValue': b''})})

In [11]: ds Out[11]: <xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: foo (x) object b'bar' nan

In [12]: ds.to_netcdf('foobar.nc')

In [13]: xr.open_dataset('foobar.nc').load() Out[13]: <xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: foo (x) object b'bar' nan ```

For variable length strings, it currently isn't possible to set a fill-value. So there's no good way to indicate missing values, though this may change if the future depending on the resolution of the netCDF-python issue.

It would obviously be nice to always automatically round-trip missing values, both for strings and bytes. I see two possible ways to do this: 1. Require setting an explicit _FillValue when a string contains missing values, by raising an error if this isn't done. We need an explicit choice because there aren't any extra unused characters left over, at least for character arrays. (NetCDF explicitly allows arbitrary bytes to be stored in NC_CHAR, even though this maps to an HDF5 fixed-width string with ASCII encoding.) For variable length strings, we could potentially set a non-character unicode symbol like U+FFFF, but again that isn't supported yet. 2. Treat empty strings as equivalent to a missing value (NaN). This has the advantage of not requiring an explicit choice of _FillValue, so we don't need to wait for any netCDF4 issues to be resolved. However, this does mean that empty strings would not round-trip. Still, given the relative prevalence of missing values vs empty strings in xarray/pandas, it's probably the lesser evil to not preserve empty string.

The default option is to adopt neither of these, and keep the current behavior where missing values are written as empty strings and not decoded at all.

Any opinions? I am leaning towards option (2).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1647/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 2 rows from issue in issue_comments
Powered by Datasette · Queries took 1.546ms · About: xarray-datasette