issues: 267542085
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
267542085 | MDU6SXNzdWUyNjc1NDIwODU= | 1647 | Representing missing values in string arrays on disk | 1217238 | closed | 0 | 3 | 2017-10-23T05:01:10Z | 2024-02-06T13:03:40Z | 2024-02-06T13:03:40Z | MEMBER | This came up as part of my clean-up of serializing unicode strings in https://github.com/pydata/xarray/pull/1648. There are two ways to represent strings in netCDF files.
Currently, by default (if no For character arrays, we could use the normal In [11]: ds Out[11]: <xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: foo (x) object b'bar' nan In [12]: ds.to_netcdf('foobar.nc') In [13]: xr.open_dataset('foobar.nc').load() Out[13]: <xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: foo (x) object b'bar' nan ``` For variable length strings, it currently isn't possible to set a fill-value. So there's no good way to indicate missing values, though this may change if the future depending on the resolution of the netCDF-python issue. It would obviously be nice to always automatically round-trip missing values, both for strings and bytes. I see two possible ways to do this:
1. Require setting an explicit The default option is to adopt neither of these, and keep the current behavior where missing values are written as empty strings and not decoded at all. Any opinions? I am leaning towards option (2). |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/1647/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |