home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 347363503

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/pull/1528#issuecomment-347363503 https://api.github.com/repos/pydata/xarray/issues/1528 347363503 MDEyOklzc3VlQ29tbWVudDM0NzM2MzUwMw== 703554 2017-11-27T23:27:41Z 2017-11-27T23:27:41Z CONTRIBUTOR

For variable length strings (or any array with an object dtype) zarr needs a filter that can encode and pack the strings into a single buffer, except in the special case where the data are being stored in-memory (as in your first example). The filter has to be specified manually, some examples here: http://zarr.readthedocs.io/en/master/tutorial.html#string-arrays. There are two codecs currently in numcodecs that can do this, one is Pickle, the other is MsgPack. I haven't done any benchmarking of data size or encoding speed, but MsgPack may be preferable because it's more portable.

There was some discussion a while back about creating a codec that handles variable-length strings by encoding via UTF8 then concatenating encoded bytes and lengths or offsets, IIRC similar to Arrow, and maybe even creating a special "text" dtype that inserts this filter automatically so you don't have to add it manually. But there hasn't been a strong motivation so far.

On Mon, Nov 27, 2017 at 10:32 PM, Stephan Hoyer notifications@github.com wrote:

Overall, I find the conventions module to be a bit unwieldy. There is a lot of stuff in there, not all of which is related to CF conventions. It would be useful to separate the actual conventions from the encoding / decoding needed for different backends.

Agreed!

I wonder why zarr doesn't have a UTF-8 variable length string type ( alimanfoo/zarr#206 https://github.com/alimanfoo/zarr/issues/206) -- that would feel like the obvious first choice for encoding this data.

That said, xarary should be able to use first-length bytes just fine, doing UTF-8 encoding/decoding on the fly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-347351224, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QkLTQUuspLhiXYR2_WMW8Hg9LFziks5s6ziTgaJpZM4PDrlp .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  253136694
Powered by Datasette · Queries took 77.885ms · About: xarray-datasette