issues: 311578894
This data as json
| id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 311578894 | MDU6SXNzdWUzMTE1Nzg4OTQ= | 2040 | to_netcdf() to automatically switch to fixed-length strings for compressed variables | 6213168 | open | 0 | 2 | 2018-04-05T11:50:16Z | 2019-01-13T01:42:03Z | MEMBER | When you have fixed-length numpy arrays of unicode characters (<U...) in a dataset, and you invoke to_netcdf() without any particular encoding, they are automatically stored as variable-length strings, unless you explicitly specify Is this in order to save disk space in case strings vary wildly in size? I may be able to see the point in this case. However, this approach is disastrous if variables are compressed, as any compression algorithm will reduce the zero-panning at the end of the strings to a negligible size. My test data: a dataset with \~50 variables, of which half are strings of 10\~100 english characters and the other half are floats, all on a single dimension with 12k points. Test 1:
Test 2:
Test 3:
ProposalIn case of string variables, if no dtype is explicitly defined, to_netcdf() should dynamically assign it to S1 if compression is enabled, str if disabled. |
{
"url": "https://api.github.com/repos/pydata/xarray/issues/2040/reactions",
"total_count": 0,
"+1": 0,
"-1": 0,
"laugh": 0,
"hooray": 0,
"confused": 0,
"heart": 0,
"rocket": 0,
"eyes": 0
} |
13221727 | issue |