home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 502481584

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/pull/2706#issuecomment-502481584 https://api.github.com/repos/pydata/xarray/issues/2706 502481584 MDEyOklzc3VlQ29tbWVudDUwMjQ4MTU4NA== 9658781 2019-06-16T20:05:04Z 2019-06-16T20:23:54Z CONTRIBUTOR

Hey there everyone, sorry for not working on this for so long from my side. I just picked it up again and realised that the way the encoding works, all the datatypes and the maximum string lengths in the first xarray have to be representative for all others. Otherwise the following cuts away every char after the second:

ds0 = xr.Dataset({'temperature': (['time'],  ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]})
ds1 = xr.Dataset({'temperature': (['time'],  ['abc', 'def', 'ghijk'])}, coords={'time': [0, 1, 2]})
ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')

It is solvable when explicitly setting the type before writing:

ds0 = xr.Dataset({'temperature': (['time'],  ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]})
ds0['temperature'] = ds0.temperature.astype(np.dtype('S5'))
ds1 = xr.Dataset({'temperature': (['time'],  ['abc', 'def', 'ghijk'])}, coords={'time': [0, 1, 2]})
ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')

It becomes however worse when using non-ascii characters, as they get encoded in zarr.py l:218, but with the next chunk that is coming in the check in conventions.py l:86 fails. So I think we actually have to resolve the the TODO in zarr.py l:215 before this is able to be merged. Otherwise, the following leads to multiple issues:

ds0 = xr.Dataset({'temperature': (['time'],  ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]})
ds1 = xr.Dataset({'temperature': (['time'],  ['üý', 'ãä', 'õö'])}, coords={'time': [0, 1, 2]})
ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')
xr.open_zarr('temp').temperature.values

The only way to work around this issue is to explicitly encode the data beforehand to utf-8:

from xarray.coding.variables import safe_setitem, unpack_for_encoding
from xarray.coding.strings import encode_string_array
from xarray.core.variable import Variable

def encode_utf8(var, string_max_length):
    dims, data, attrs, encoding = unpack_for_encoding(var)
    safe_setitem(attrs, '_Encoding', 'utf-8')
    data = encode_string_array(data, 'utf-8')
    data = data.astype(np.dtype(f"S{string_max_length*2}"))
    return Variable(dims, data, attrs, encoding)

ds0 = xr.Dataset({'temperature': (['time'],  ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]})
ds0['temperature'] = encode_utf8(ds0.temperature, 2)
ds1 = xr.Dataset({'temperature': (['time'],  ['üý', 'ãä', 'õö'])}, coords={'time': [0, 1, 2]})
ds1['temperature'] = encode_utf8(ds1.temperature, 2)
ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')
xr.open_zarr('temp').temperature.values

Even though this is doable if it is known in advance, we should definitely mention this in the documentation or fix this by fixing the encoding itself. What do you think?

Cheers,

Jendrik

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  402908148
Powered by Datasette · Queries took 0.749ms · About: xarray-datasette