home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 311578894

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
311578894 MDU6SXNzdWUzMTE1Nzg4OTQ= 2040 to_netcdf() to automatically switch to fixed-length strings for compressed variables 6213168 open 0     2 2018-04-05T11:50:16Z 2019-01-13T01:42:03Z   MEMBER      

When you have fixed-length numpy arrays of unicode characters (<U...) in a dataset, and you invoke to_netcdf() without any particular encoding, they are automatically stored as variable-length strings, unless you explicitly specify {'dtype': 'S1'}.

Is this in order to save disk space in case strings vary wildly in size? I may be able to see the point in this case. However, this approach is disastrous if variables are compressed, as any compression algorithm will reduce the zero-panning at the end of the strings to a negligible size.

My test data: a dataset with \~50 variables, of which half are strings of 10\~100 english characters and the other half are floats, all on a single dimension with 12k points.

Test 1: ds.to_netcdf('uncompressed.nc') Result: 45MB

Test 2: encoding = {k: {'gzip': True, 'shuffle': True} for k in ds.variables} ds.to_netcdf('bad-compression.nc', encoding=encoding) Result: 42MB

Test 3: encoding = {} for k, v in ds.variables.items(): encoding[k] = {'gzip': True, 'shuffle': True} if v.dtype.kind == 'U': encoding[k]['dtype'] = 'S1' ds.to_netcdf('good-compression.nc', encoding=encoding) Result: 5MB

Proposal

In case of string variables, if no dtype is explicitly defined, to_netcdf() should dynamically assign it to S1 if compression is enabled, str if disabled.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2040/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 2 rows from issue in issue_comments
Powered by Datasette · Queries took 0.803ms · About: xarray-datasette