issue_comments: 347363503

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/pull/1528#issuecomment-347363503	https://api.github.com/repos/pydata/xarray/issues/1528	347363503	MDEyOklzc3VlQ29tbWVudDM0NzM2MzUwMw==	703554	2017-11-27T23:27:41Z	2017-11-27T23:27:41Z	CONTRIBUTOR	For variable length strings (or any array with an object dtype) zarr needs a filter that can encode and pack the strings into a single buffer, except in the special case where the data are being stored in-memory (as in your first example). The filter has to be specified manually, some examples here: http://zarr.readthedocs.io/en/master/tutorial.html#string-arrays. There are two codecs currently in numcodecs that can do this, one is Pickle, the other is MsgPack. I haven't done any benchmarking of data size or encoding speed, but MsgPack may be preferable because it's more portable. There was some discussion a while back about creating a codec that handles variable-length strings by encoding via UTF8 then concatenating encoded bytes and lengths or offsets, IIRC similar to Arrow, and maybe even creating a special "text" dtype that inserts this filter automatically so you don't have to add it manually. But there hasn't been a strong motivation so far. On Mon, Nov 27, 2017 at 10:32 PM, Stephan Hoyer notifications@github.com wrote: Overall, I find the conventions module to be a bit unwieldy. There is a lot of stuff in there, not all of which is related to CF conventions. It would be useful to separate the actual conventions from the encoding / decoding needed for different backends. Agreed! I wonder why zarr doesn't have a UTF-8 variable length string type ( alimanfoo/zarr#206 https://github.com/alimanfoo/zarr/issues/206) -- that would feel like the obvious first choice for encoding this data. That said, xarary should be able to use first-length bytes just fine, doing UTF-8 encoding/decoding on the fly. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-347351224, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QkLTQUuspLhiXYR2_WMW8Hg9LFziks5s6ziTgaJpZM4PDrlp . -- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/aliman limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		253136694