issue_comments: 502481584

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/pull/2706#issuecomment-502481584	https://api.github.com/repos/pydata/xarray/issues/2706	502481584	MDEyOklzc3VlQ29tbWVudDUwMjQ4MTU4NA==	9658781	2019-06-16T20:05:04Z	2019-06-16T20:23:54Z	CONTRIBUTOR	Hey there everyone, sorry for not working on this for so long from my side. I just picked it up again and realised that the way the encoding works, all the datatypes and the maximum string lengths in the first xarray have to be representative for all others. Otherwise the following cuts away every char after the second: `ds0 = xr.Dataset({'temperature': (['time'], ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]}) ds1 = xr.Dataset({'temperature': (['time'], ['abc', 'def', 'ghijk'])}, coords={'time': [0, 1, 2]}) ds0.to_zarr('temp') ds1.to_zarr('temp', mode='a', append_dim='time')` It is solvable when explicitly setting the type before writing: `ds0 = xr.Dataset({'temperature': (['time'], ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]}) ds0['temperature'] = ds0.temperature.astype(np.dtype('S5')) ds1 = xr.Dataset({'temperature': (['time'], ['abc', 'def', 'ghijk'])}, coords={'time': [0, 1, 2]}) ds0.to_zarr('temp') ds1.to_zarr('temp', mode='a', append_dim='time')` It becomes however worse when using non-ascii characters, as they get encoded in zarr.py l:218, but with the next chunk that is coming in the check in conventions.py l:86 fails. So I think we actually have to resolve the the TODO in zarr.py l:215 before this is able to be merged. Otherwise, the following leads to multiple issues: `ds0 = xr.Dataset({'temperature': (['time'], ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]}) ds1 = xr.Dataset({'temperature': (['time'], ['üý', 'ãä', 'õö'])}, coords={'time': [0, 1, 2]}) ds0.to_zarr('temp') ds1.to_zarr('temp', mode='a', append_dim='time') xr.open_zarr('temp').temperature.values` The only way to work around this issue is to explicitly encode the data beforehand to utf-8: from xarray.coding.variables import safe_setitem, unpack_for_encoding from xarray.coding.strings import encode_string_array from xarray.core.variable import Variable def encode_utf8(var, string_max_length): dims, data, attrs, encoding = unpack_for_encoding(var) safe_setitem(attrs, '_Encoding', 'utf-8') data = encode_string_array(data, 'utf-8') data = data.astype(np.dtype(f"S{string_max_length*2}")) return Variable(dims, data, attrs, encoding) ds0 = xr.Dataset({'temperature': (['time'], ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]}) ds0['temperature'] = encode_utf8(ds0.temperature, 2) ds1 = xr.Dataset({'temperature': (['time'], ['üý', 'ãä', 'õö'])}, coords={'time': [0, 1, 2]}) ds1['temperature'] = encode_utf8(ds1.temperature, 2) ds0.to_zarr('temp') ds1.to_zarr('temp', mode='a', append_dim='time') xr.open_zarr('temp').temperature.values Even though this is doable if it is known in advance, we should definitely mention this in the documentation or fix this by fixing the encoding itself. What do you think? Cheers, Jendrik	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		402908148

https://github.com/pydata/xarray/pull/2706#issuecomment-502481584

https://api.github.com/repos/pydata/xarray/issues/2706

502481584

MDEyOklzc3VlQ29tbWVudDUwMjQ4MTU4NA==

9658781

2019-06-16T20:05:04Z

2019-06-16T20:23:54Z

CONTRIBUTOR

Hey there everyone, sorry for not working on this for so long from my side. I just picked it up again and realised that the way the encoding works, all the datatypes and the maximum string lengths in the first xarray have to be representative for all others. Otherwise the following cuts away every char after the second:

ds0 = xr.Dataset({'temperature': (['time'],  ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]})
ds1 = xr.Dataset({'temperature': (['time'],  ['abc', 'def', 'ghijk'])}, coords={'time': [0, 1, 2]})
ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')

It is solvable when explicitly setting the type before writing:

ds0 = xr.Dataset({'temperature': (['time'],  ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]})
ds0['temperature'] = ds0.temperature.astype(np.dtype('S5'))
ds1 = xr.Dataset({'temperature': (['time'],  ['abc', 'def', 'ghijk'])}, coords={'time': [0, 1, 2]})
ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')

It becomes however worse when using non-ascii characters, as they get encoded in zarr.py l:218, but with the next chunk that is coming in the check in conventions.py l:86 fails. So I think we actually have to resolve the the TODO in zarr.py l:215 before this is able to be merged. Otherwise, the following leads to multiple issues:

ds0 = xr.Dataset({'temperature': (['time'],  ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]})
ds1 = xr.Dataset({'temperature': (['time'],  ['üý', 'ãä', 'õö'])}, coords={'time': [0, 1, 2]})
ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')
xr.open_zarr('temp').temperature.values

The only way to work around this issue is to explicitly encode the data beforehand to utf-8:

from xarray.coding.variables import safe_setitem, unpack_for_encoding
from xarray.coding.strings import encode_string_array
from xarray.core.variable import Variable

def encode_utf8(var, string_max_length):
    dims, data, attrs, encoding = unpack_for_encoding(var)
    safe_setitem(attrs, '_Encoding', 'utf-8')
    data = encode_string_array(data, 'utf-8')
    data = data.astype(np.dtype(f"S{string_max_length*2}"))
    return Variable(dims, data, attrs, encoding)

ds0 = xr.Dataset({'temperature': (['time'],  ['ab', 'cd', 'ef'])}, coords={'time': [0, 1, 2]})
ds0['temperature'] = encode_utf8(ds0.temperature, 2)
ds1 = xr.Dataset({'temperature': (['time'],  ['üý', 'ãä', 'õö'])}, coords={'time': [0, 1, 2]})
ds1['temperature'] = encode_utf8(ds1.temperature, 2)
ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')
xr.open_zarr('temp').temperature.values

Even though this is doable if it is known in advance, we should definitely mention this in the documentation or fix this by fixing the encoding itself. What do you think?

Cheers,

Jendrik

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}

402908148