home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 258323743

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1077#issuecomment-258323743 https://api.github.com/repos/pydata/xarray/issues/1077 258323743 MDEyOklzc3VlQ29tbWVudDI1ODMyMzc0Mw== 1217238 2016-11-04T01:38:41Z 2016-11-04T01:38:56Z MEMBER

This is a good question -- I don't think we've figured it out yet. Maybe you have ideas?

The main question (to me) is whether we should store raw values for each level in a MultiIndex (closer to what you see), or category encoded values (closer to the MultiIndex implementation).

To more concrete, here it what these look like for an example MultiIndex:

``` In [1]: index = pd.MultiIndex.from_product([[1, 2, 3], ['a', 'b']], names=['numbers', 'letters'])

In [2]: index Out[2]: MultiIndex(levels=[[1, 2, 3], ['a', 'b']], labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]], names=['numbers', 'letters'])

In [3]: index.values Out[3]: array([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'b')], dtype=object)

categorical encoded values

In [4]: index.levels, index.labels Out[4]: (FrozenList([[1, 2, 3], ['a', 'b']]), FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]]))

raw values

In [5]: index.get_level_values(0), index.get_level_values(1) Out[5]: (Int64Index([1, 1, 2, 2, 3, 3], dtype='int64', name='numbers'), Index(['a', 'b', 'a', 'b', 'a', 'b'], dtype='object', name='letters')) ```

Advantages of storing raw values: - It's easier to work with MultiIndex levels without xarray, or with older versions of xarray (no need to combine levels and labels). - Avoiding the overhead of saving integer codes can save memory if levels have dtypes with small fixed sizes (e.g., float, int or datetime) or mostly distinct values.

Advantages of storing category encoded values: - It's cheaper to construct the MultiIndex, because we have already factorized each level. - It can result in significant memory savings if levels are mostly duplicated (e.g., a tensor product) or have large itemsize (e.g., long strings). - We can restore the exact same MultiIndex, instead of refactorizing it. This manifests itself in a few edge cases that could make for a frustrating user experience (changed dimension order after stacking: https://github.com/pydata/xarray/issues/980).

Perhaps the best approach would be a hybrid: store raw values, as well as de-duplicated integer codes specifying the order of values in each MultiIndex level. This will be a little slower than just storing the raw values, but has the correctness guarantee provided by storing category encoded values.

Either way, we will need to store an attribute or two with metadata for how to restore the levels (e.g., 'multiindex_levels: numbers letters').

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  187069161
Powered by Datasette · Queries took 0.824ms · About: xarray-datasette