home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 347323043

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/pull/1528#issuecomment-347323043 https://api.github.com/repos/pydata/xarray/issues/1528 347323043 MDEyOklzc3VlQ29tbWVudDM0NzMyMzA0Mw== 1197350 2017-11-27T20:48:35Z 2017-11-27T20:53:28Z MEMBER

After a few more tweaks, this is now quite close to passing all the CFEncodedDataTest tests.

The remaining issues are all related to the encoding of strings. Basically, zarr's handling of strings: http://zarr.readthedocs.io/en/latest/tutorial.html?highlight=strings#string-arrays is considerably different from netCDF's. Because ZarrStore is a subclass of WritableCFDataStore, all of the dataset variables get passed through encode_cf_variable before writing. This screws up things that actually work already quite naturally.

Consider the following direct creation of a variable length string in zarr: python values = np.array([b'ab', b'cdef', np.nan], dtype=object) zgs = zarr.open_group() zgs.create('x', shape=values.shape, dtype=values.dtype) zgs.x[:] = values zgs.x
Array(/x, (3,), object, chunks=(3,), order=C) nbytes: 24; nbytes_stored: 350; ratio: 0.1; initialized: 1/1 compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) store: DictStore

It seems we can encode variable-length strings into objects just fine. (np.testing.assert_array_equal(values, zgs.x[:]) fails only because of the nan value. The array round-trips just fine.)

However, after passing through xarray's cf encoding, this no longer works: python encoding = {'_FillValue': b'X', 'dtype': 'S1'} original = xr.Dataset({'x': ('t', values, {}, encoding)}) zarr_dict_store = {} original.to_zarr(store=zarr_dict_store) zs = zarr.open_group(store=zarr_dict_store) print(zs.x) print(zs.x[:]) Array(/x, (3, 4), |S1, chunks=(3, 4), order=C) nbytes: 12; nbytes_stored: 428; ratio: 0.0; initialized: 1/1 compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) store: dict array([[b'a', b'b', b'', b''], [b'c', b'd', b'e', b'f'], [b'X', b'', b'', b'']], dtype='|S1')

Here is everything that happens in encode_cf_variable: python var = maybe_encode_datetime(var, name=name) var = maybe_encode_timedelta(var, name=name) var, needs_copy = maybe_encode_offset_and_scale(var, needs_copy, name=name) var, needs_copy = maybe_encode_fill_value(var, needs_copy, name=name) var = maybe_encode_nonstring_dtype(var, name=name) var = maybe_default_fill_value(var) var = maybe_encode_bools(var) var = ensure_dtype_not_object(var, name=name) var = maybe_encode_string_dtype(var, name=name)

The challenge now is to figure out which parts of this we need to bypass for zarr and how to implement that bypassing.

Overall, I find the conventions module to be a bit unwieldy. There is a lot of stuff in there, not all of which is related to CF conventions. It would be useful to separate the actual conventions from the encoding / decoding needed for different backends.

At this point, I would appreciate some input from an encoding expert before I go refactoring stuff.

edit: The actual tests that fail are CFEncodedDataTest.test_roundtrip_bytes_with_fill_value and CFEncodedDataTest.test_roundtrip_string_encoded_characters. One option to move forward would be just to skip those tests for zarr. I am eager to get this out in the wild to see how it plays with real datasets.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  253136694
Powered by Datasette · Queries took 0.672ms · About: xarray-datasette