home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

3 rows where author_association = "MEMBER", issue = 230566456 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

These facets timed out: author_association, issue

user 1

  • shoyer · 3 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
303606683 https://github.com/pydata/xarray/pull/1421#issuecomment-303606683 https://api.github.com/repos/pydata/xarray/issues/1421 MDEyOklzc3VlQ29tbWVudDMwMzYwNjY4Mw== shoyer 1217238 2017-05-24T03:18:16Z 2017-05-24T03:18:16Z MEMBER

How about something like the following:

In encode_cf_variable, create a new variable with pickle encoded data (if appropriate). This looks something like: ```python def encode_cf_variable(var, allow_pickle=False): ... if var.dtype == object: if allow_pickle: var = maybe_encode_pickle(var) else: raise TypeError return var

def maybe_encode_pickle(var): if var.dtype == object: attrs = var.attrs.copy() safe_setitem('_FileFormat', 'python-pickle') protocol = var.encoding.pop('pickle_protocol', 2) data = utils.encode_pickle(var.values, protocol=protocol) var = Variable(var.dims, data, attrs, var.encoding) return var `` This reuses theencoding` parameter for setting the pickle protocol version, which is already what we use for similar variable specific encoding details.

In the netCDF backends, add a check for variable with dtype == object with a _FileFormat attribute. If this is the case, call a create_vlen_int8_dtype method to create an appropriate dtype using backend specific methods (The behavior on the base class should raise an error), and proceed with writing the data in the usual way.

For decoding, reverse the process. Convert custom vlen dtypes to dtype=object in the appropriate backend specific array wrapper type, but don't decode data. If allow_pickle=True and var.attrs['_FileFormat'] == 'python-pickle', then decode_cf_variable should do the unpickling, moving _FileFormat from attrs to encoding.

For bonus points, generalize handling of vlen types with encoding. Something like encoding={'dtype': {'vlen': np.int8}} could indicate that a special vlen with np.int8 data should be created to encode this variable's data. (This is inspired by h5py's API for special types.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Adding arbitrary object serialization 230566456
303583082 https://github.com/pydata/xarray/pull/1421#issuecomment-303583082 https://api.github.com/repos/pydata/xarray/issues/1421 MDEyOklzc3VlQ29tbWVudDMwMzU4MzA4Mg== shoyer 1217238 2017-05-24T00:52:20Z 2017-05-24T00:52:20Z MEMBER

Pickle works out the protocol automatically (no protocol keyword for load or loads), so we wouldn't really need to save the protocol as an attribute, although it would be a way to work out which variables to unpickle once saved, if we went this route.

I think we do want some sort of marker attribute, but I agree that it doesn't need to include the pickle version.

Maybe the attribute _FileFormat = 'python-pickle' would make sense? This would have the advantage of being obvious to anyone inspecting the netCDF file with standard tools (not xarray).

It seems a shame not to use np.void though, so perhaps it makes sense to add the opaque types to netCDF4-python and forget the np.uint8 trick.

I think netCDF actually maps np.int8 -> NC_BYTE, so that's at least some justification for this choice: http://www.unidata.ucar.edu/software/netcdf/docs/data_type.html

Certainly handling opaque types in netCDF4-python would be nice, though I don't think it should be a blocker for this. I suspect the reason this isn't done is that NumPy maps bytes -> np.string_ even on Python 3. Thus np.void is used far less often than it should. Also the repr for np.void has been pretty poor, though that's being worked on currently in https://github.com/numpy/numpy/pull/8981.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Adding arbitrary object serialization 230566456
303541907 https://github.com/pydata/xarray/pull/1421#issuecomment-303541907 https://api.github.com/repos/pydata/xarray/issues/1421 MDEyOklzc3VlQ29tbWVudDMwMzU0MTkwNw== shoyer 1217238 2017-05-23T21:46:39Z 2017-05-23T21:48:18Z MEMBER

Thanks for giving this a shot!

I added allow_object kwarg (rather than allow_pickle, no reason to firmly attach pickle to the api, could use something else for other backends).

I'm having a hard time imagining any other serialization formats for serializing arbitrary Python objects. pickle is pretty standard, though we might switch the argument for to_netcdf to pickle_protocol to allow indicating the pickle version (which would default to None, for don't pickle).

One addition reason for favoring allow_pickle is that it's the argument used by np.save and np.load.

NetCDF4DataStore handles this independently from the cf_encoder/decoder. The dtype support made it hard to decouple, plus I think object serialization is a backend dependent issue.

Yes, this is a little tricky. The current design is not great here. Ideally, though, we would still keep all of the encoding/decoding logic separate from the datastores. I need to think about this a little more.


One other concern is how to represent this data on disk in netCDF/HDF5 variables. Ideally, we would have a format that could work -- at least in principle -- with h5netcdf/h5py as well as netCDF4-python.

Annoyingly, these libraries currently have incompatible dtype support:

  • netCDF4-python supports variable length types with custom name. It does not support HDF5's opaque type, though the netCDF-C libraries do support opaque types, so adding them to netCDF4-Python would be relatively straightforward.
  • h5py supports variable length types, but not their name field. It maps np.void to HDF5's opaque type, which would be a pretty sensible storage type if netCDF4-Python supported it.

So if we want something that works with both, we'll need to add some additional metadata field in the form of an attribute to indicate how do decoding. Maybe something like _PickleProtocol, which would store the version of the pickle protocol used to write the data?


I have some inline comments I'll add below.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Adding arbitrary object serialization 230566456

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 3759.045ms · About: xarray-datasette