home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 286176727

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1077#issuecomment-286176727 https://api.github.com/repos/pydata/xarray/issues/1077 286176727 MDEyOklzc3VlQ29tbWVudDI4NjE3NjcyNw== 1217238 2017-03-13T17:14:37Z 2017-03-13T17:14:37Z MEMBER

Let's recap the options, which I'll illustrate for the second level of my MultiIndex from above (https://github.com/pydata/xarray/issues/1077#issuecomment-258323743):

  1. "categories and codes": e.g., ['a', 'b'] and [0, 1, 0, 1, 0, 1]. Highest speed, low memory requirements, faithful round-trip to xarray/pandas, less obvious representation.
  2. "categories and values": e.g., ['a', 'b'] and ['a', 'b', 'a', 'b', 'a', 'b']. Moderate speed (need recreate codes), high memory requirements, faithful round-trip to xarray/pandas, more obvious representation (categories can be safely ignored).
  3. "raw values": e.g., ['a', 'b', 'a', 'b', 'a', 'b']. Moderate speed (only slightly slower than 2), high memory requirements (slightly better than 2), does not support completely faithful roundtrip, most obvious representation.
  4. "category codes and values": e.g., [0, 1] and ['a', 'b', 'a', 'b', 'a', 'b']. Moderate speed, high memory requirements, also does not support faithful roundtrip (it's possible for some levels to not be represented in the MultiIndex values), more obvious representation (like 2).

3 uses only slightly less memory than 2 and can be easily achieved with reset_index(), so I don't see a reason to support it for writing (read support would be fine).

4 looks like a faithful roundtrip, but actually isn't in some rare edge cases. That seems like a recipe for disaster, so it should be OK.

This leaves 1 and 2. Both are reasonably performant and roundtrip xarray objects with complete fidelity, so I would be happy with either them. In principle we could even support both, with an argument to switch between the modes (one would need to be the default).

My inclination is start with only supporting 1, because it has a potentially large advantage from a speed/memory perspective, and it's easy to achieve the "raw values" representation with .reset_index() (and convert back with .set_index()). If we do this, the documentation for writing netCDF files should definitely include a suggestion to consider using .reset_index() when distributing files not intended strictly for use by xarray users.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  187069161
Powered by Datasette · Queries took 0.744ms · About: xarray-datasette