home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

2 rows where issue = 311578894 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • shoyer 2

issue 1

  • to_netcdf() to automatically switch to fixed-length strings for compressed variables · 2 ✖

author_association 1

  • MEMBER 2
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
379418732 https://github.com/pydata/xarray/issues/2040#issuecomment-379418732 https://api.github.com/repos/pydata/xarray/issues/2040 MDEyOklzc3VlQ29tbWVudDM3OTQxODczMg== shoyer 1217238 2018-04-07T00:32:46Z 2018-04-07T00:32:46Z MEMBER

One potentially option would be to make choose the default behavior based on the string data type: - Fixed-width unicode arrays (np.unicode_) get written as fixed-width strings with a stored encoding. - Object arrays full of Python strings (np.object_) get written as variable width strings.

Note that fixed-width unicode in NumPy (fixed number of unicode characters) does not correspond to the same memory layout as fixed width strings in HDF5 (fixed length in bytes), but maybe it's close enough.

The main reason why we don't do any special handling for object arrays currently in xarray is that our conventions coding/decoding system has no way of marking variable length string arrays. We should probably handle this by making a custom dtype like h5py that marks variables length strings using dtype metadata: http://docs.h5py.org/en/latest/special.html#variable-length-strings

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_netcdf() to automatically switch to fixed-length strings for compressed variables  311578894
379294800 https://github.com/pydata/xarray/issues/2040#issuecomment-379294800 https://api.github.com/repos/pydata/xarray/issues/2040 MDEyOklzc3VlQ29tbWVudDM3OTI5NDgwMA== shoyer 1217238 2018-04-06T15:47:24Z 2018-04-06T15:47:24Z MEMBER

The main reason for preferring variable length strings was that netCDF4-python always properly decoded them as unicode strings, even on Python 3. Basically, it was required to properly round-trip strings to a netCDF file on Python 3.

However, this is no longer the case, now that we specify an encoding when writing fixed length strings (https://github.com/pydata/xarray/pull/1648). So we could potentially revisit the default behavior.

I'll admit I'm also a little surprised by how large the storage overhead turns out to be for variable length datatypes. The HDF5 docs claim it's 32 bytes per element, which would be about 10 MB or so for your dataset. And apparently it interacts poorly with compression, too.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_netcdf() to automatically switch to fixed-length strings for compressed variables  311578894

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 164.445ms · About: xarray-datasette