html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/1077#issuecomment-1270514913,https://api.github.com/repos/pydata/xarray/issues/1077,1270514913,IC_kwDOAMm_X85LuoTh,2448579,2022-10-06T18:31:51Z,2022-10-06T18:31:51Z,MEMBER,Thanks @lucianopaz I fixed some errors when I added it to [cf-xarray](https://cf-xarray.readthedocs.io/en/latest/coding.html) It would be good to see if that version works for you.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-1101505074,https://api.github.com/repos/pydata/xarray/issues/1077,1101505074,IC_kwDOAMm_X85Bp6Iy,2448579,2022-04-18T15:36:19Z,2022-04-18T15:36:19Z,MEMBER,"I added the ""compression by gathering"" scheme to cf-xarray. 1. https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.encode_multi_index_as_compress.html 1. https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.decode_compress_to_multi_index.html","{""total_count"": 2, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 2, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-645416425,https://api.github.com/repos/pydata/xarray/issues/1077,645416425,MDEyOklzc3VlQ29tbWVudDY0NTQxNjQyNQ==,2448579,2020-06-17T14:40:19Z,2020-06-17T14:40:19Z,MEMBER,"@shoyer I now understand your earlier comment. I agree that it should work with both sparse and MultiIndex but as such there's no way to decide whether this should be decoded to a sparse array or a MultiIndexed dense array. Following your comment in https://github.com/pydata/xarray/issues/3213#issuecomment-521533999 > Fortunately, there does seems to be a CF convention that would be a good fit for for sparse data in COO format, namely the indexed ragged array representation (example, note the instance_dimension attribute). That's probably the right thing to use for sparse arrays in xarray. How about using this ""compression by gathering"" idea for MultiIndexed dense arrays and ""indexed ragged arrays"" for sparse arrays? I do not know the internals of `sparse` or the details of the CF conventions to have a strong opinion on which representation to prefer for `sparse.COO` arrays. PS: CF convention for ""indexed ragged arrays"" is here: http://cfconventions.org/cf-conventions/cf-conventions.html#_indexed_ragged_array_representation ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-645142014,https://api.github.com/repos/pydata/xarray/issues/1077,645142014,MDEyOklzc3VlQ29tbWVudDY0NTE0MjAxNA==,1217238,2020-06-17T04:28:56Z,2020-06-17T04:28:56Z,MEMBER,"It still isn't clear to me why this is a better representation for a MultiIndex than a sparse array. I guess it could work fine for either, but we would need to pick a convention.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-645139667,https://api.github.com/repos/pydata/xarray/issues/1077,645139667,MDEyOklzc3VlQ29tbWVudDY0NTEzOTY2Nw==,6815844,2020-06-17T04:21:40Z,2020-06-17T04:21:40Z,MEMBER,"@dcherian. Now I understood. Your working examples were really nice for me to understand the idea. Thank you for this clarification. I think the use of this convention is the best idea to save MultiIndex in netCDF. Maybe we can start implementing this? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-644803374,https://api.github.com/repos/pydata/xarray/issues/1077,644803374,MDEyOklzc3VlQ29tbWVudDY0NDgwMzM3NA==,2448579,2020-06-16T14:31:23Z,2020-06-16T14:31:23Z,MEMBER,"I may be missing something but @fujiisoup's concern is addressed by the scheme in the CF conventions. > In your encoded, how can we tell the MultiIndex is [('a', 1), ('b', 1), ('a', 2), ('b', 2)] or [('a', 1), ('a', 2), ('b', 1), ('b', 2)]? The information about ordering is stored as 1D indexes of an ND array; constructed using `np.ravel_multi_index` in the `encode_multiindex` function: `encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape)` For example, see the dimension coordinate `landpoint` in the encoded form ``` >>> ds3 Dimensions: (landpoint: 4) Coordinates: * landpoint (landpoint) MultiIndex - lat (landpoint) object 'a' 'b' 'b' 'a' - lon (landpoint) int64 1 2 1 2 Data variables: landsoilt (landpoint) float64 -0.2699 -1.228 0.4632 0.2287 ``` ``` >>> encode_multiindex(ds3, ""landpoint"") Dimensions: (landpoint: 4, lat: 2, lon: 2) Coordinates: * lat (lat) object 'a' 'b' * lon (lon) int64 1 2 * landpoint (landpoint) int64 0 3 2 1 Data variables: landsoilt (landpoint) float64 -0.2699 -1.228 0.4632 0.2287 ``` Here is a cleaned up version of the code for easy testing ``` python import numpy as np import pandas as pd import xarray as xr def encode_multiindex(ds, idxname): encoded = ds.reset_index(idxname) coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels)) for coord in coords: encoded[coord] = coords[coord].values shape = [encoded.sizes[coord] for coord in coords] encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape) encoded[idxname].attrs[""compress""] = "" "".join(ds.indexes[idxname].names) return encoded def decode_to_multiindex(encoded, idxname): names = encoded[idxname].attrs[""compress""].split("" "") shape = [encoded.sizes[dim] for dim in names] indices = np.unravel_index(encoded.landpoint.values, shape) arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)] mindex = pd.MultiIndex.from_arrays(arrays) decoded = xr.Dataset({}, {idxname: mindex}) for varname in encoded.data_vars: if idxname in encoded[varname].dims: decoded[varname] = (idxname, encoded[varname].values) return decoded ds1 = xr.Dataset( {""landsoilt"": (""landpoint"", np.random.randn(4))}, { ""landpoint"": pd.MultiIndex.from_product( [[""a"", ""b""], [1, 2]], names=(""lat"", ""lon"") ) }, ) ds2 = xr.Dataset( {""landsoilt"": (""landpoint"", np.random.randn(4))}, { ""landpoint"": pd.MultiIndex.from_arrays( [[""a"", ""b"", ""c"", ""d""], [1, 2, 4, 10]], names=(""lat"", ""lon"") ) }, ) ds3 = xr.Dataset( {""landsoilt"": (""landpoint"", np.random.randn(4))}, { ""landpoint"": pd.MultiIndex.from_arrays( [[""a"", ""b"", ""b"", ""a""], [1, 2, 1, 2]], names=(""lat"", ""lon"") ) }, ) idxname = ""landpoint"" for dataset in [ds1, ds2, ds3]: xr.testing.assert_identical( decode_to_multiindex(encode_multiindex(dataset, idxname), idxname), dataset ) ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-644451622,https://api.github.com/repos/pydata/xarray/issues/1077,644451622,MDEyOklzc3VlQ29tbWVudDY0NDQ1MTYyMg==,1217238,2020-06-16T00:00:40Z,2020-06-16T00:00:40Z,MEMBER,"I agree with @fujiisoup. I think this ""compression-by-gathering"" representation makes more sense for sparse arrays than for a MultiIndex, per se. That said, MultiIndex and sparse arrays are basically two sides of the same idea. In the long term, it might make sense to try only support one of the two.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-644447471,https://api.github.com/repos/pydata/xarray/issues/1077,644447471,MDEyOklzc3VlQ29tbWVudDY0NDQ0NzQ3MQ==,6815844,2020-06-15T23:45:27Z,2020-06-15T23:45:27Z,MEMBER,"@dcherian I think the problem is how to serialize `MultiIndex` objects rather than the array itself. In your `encoded`, how can we tell the MultiIndex is `[('a', 1), ('b', 1), ('a', 2), ('b', 2)]` or `[('a', 1), ('a', 2), ('b', 1), ('b', 2)]`? Maybe we need to store similar objects to `landpoint` for level variables, such as `latpoint` and `lonpoint`. I think just using `reset_index` is simpler and easier to restore.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-644442679,https://api.github.com/repos/pydata/xarray/issues/1077,644442679,MDEyOklzc3VlQ29tbWVudDY0NDQ0MjY3OQ==,2448579,2020-06-15T23:29:11Z,2020-06-15T23:38:30Z,MEMBER,"This seems to be possible following http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#compression-by-gathering Here is a quick proof of concept: ``` python import numpy as np import pandas as pd import xarray as xr # example 1 ds = xr.Dataset( {""landsoilt"": (""landpoint"", np.random.randn(4))}, { ""landpoint"": pd.MultiIndex.from_product( [[""a"", ""b""], [1, 2]], names=(""lat"", ""lon"") ) }, ) # example 2 # ds = xr.Dataset( # {""landsoilt"": (""landpoint"", np.random.randn(4))}, # { # ""landpoint"": pd.MultiIndex.from_arrays( # [[""a"", ""b"", ""c"", ""d""], [1, 2, 4, 10]], names=(""lat"", ""lon"") # ) # }, # ) # encode step # detect using isinstance(index, pd.MultiIndex) idxname = ""landpoint"" encoded = ds.reset_index(idxname) coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels)) for coord in coords: encoded[coord] = coords[coord].values shape = [encoded.sizes[coord] for coord in coords] encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape) encoded[idxname].attrs[""compress""] = "" "".join(ds.indexes[idxname].names) # decode step # detect using ""compress"" in var.attrs idxname = ""landpoint"" names = encoded[idxname].attrs[""compress""].split("" "") shape = [encoded.sizes[dim] for dim in names] indices = np.unravel_index(encoded.landpoint.values, shape) arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)] mindex = pd.MultiIndex.from_arrays(arrays) decoded = xr.Dataset({}, {idxname: mindex}) decoded[""landsoilt""] = (idxname, encoded[""landsoilt""].values) xr.testing.assert_identical(decoded, ds) ``` `encoded` can be serialized using our existing code: ``` Dimensions: (landpoint: 4, lat: 2, lon: 2) Coordinates: * lat (lat) object 'a' 'b' * lon (lon) int64 1 2 * landpoint (landpoint) int64 0 1 2 3 Data variables: landsoilt (landpoint) float64 -1.668 -1.003 1.084 1.963 ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-478058340,https://api.github.com/repos/pydata/xarray/issues/1077,478058340,MDEyOklzc3VlQ29tbWVudDQ3ODA1ODM0MA==,1217238,2019-03-29T16:15:22Z,2019-03-29T16:15:22Z,MEMBER,"Once we finish https://github.com/pydata/xarray/issues/1603, that may change our perspective here a little bit (and could indirectly solve this problem).","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-286176727,https://api.github.com/repos/pydata/xarray/issues/1077,286176727,MDEyOklzc3VlQ29tbWVudDI4NjE3NjcyNw==,1217238,2017-03-13T17:14:37Z,2017-03-13T17:14:37Z,MEMBER,"Let's recap the options, which I'll illustrate for the second level of my MultiIndex from above (https://github.com/pydata/xarray/issues/1077#issuecomment-258323743): 1. ""categories and codes"": e.g., `['a', 'b']` and `[0, 1, 0, 1, 0, 1]`. Highest speed, low memory requirements, faithful round-trip to xarray/pandas, less obvious representation. 2. ""categories and values"": e.g., `['a', 'b']` and `['a', 'b', 'a', 'b', 'a', 'b']`. Moderate speed (need recreate codes), high memory requirements, faithful round-trip to xarray/pandas, more obvious representation (categories can be safely ignored). 3. ""raw values"": e.g., `['a', 'b', 'a', 'b', 'a', 'b']`. Moderate speed (only slightly slower than 2), high memory requirements (slightly better than 2), does *not* support completely faithful roundtrip, most obvious representation. 4. ""category codes and values"": e.g., `[0, 1]` and `['a', 'b', 'a', 'b', 'a', 'b']`. Moderate speed, high memory requirements, also does not support faithful roundtrip (it's possible for some levels to not be represented in the `MultiIndex` values), more obvious representation (like 2). 3 uses only slightly less memory than 2 and can be easily achieved with `reset_index()`, so I don't see a reason to support it for writing (read support would be fine). 4 looks like a faithful roundtrip, but actually isn't in some rare edge cases. That seems like a recipe for disaster, so it should be OK. This leaves 1 and 2. Both are reasonably performant and roundtrip xarray objects with complete fidelity, so I would be happy with either them. In principle we could even support both, with an argument to switch between the modes (one would need to be the default). My inclination is start with only supporting 1, because it has a potentially large advantage from a speed/memory perspective, and it's easy to achieve the ""raw values"" representation with `.reset_index()` (and convert back with `.set_index()`). If we do this, the documentation for writing netCDF files should definitely include a suggestion to consider using `.reset_index()` when distributing files not intended strictly for use by xarray users.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-260686932,https://api.github.com/repos/pydata/xarray/issues/1077,260686932,MDEyOklzc3VlQ29tbWVudDI2MDY4NjkzMg==,1217238,2016-11-15T16:16:47Z,2016-11-15T16:16:47Z,MEMBER,"`DatasetNode` feels a little too complex to me and disjoint from the rest of the package. I don't know when I would recommend using a `DatasetNode` to store data. Also, as written I don't see any aspects that _need_ to live in core xarray -- it seems that it can mostly be done with the external interface. So I would suggest the separate package. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-260645000,https://api.github.com/repos/pydata/xarray/issues/1077,260645000,MDEyOklzc3VlQ29tbWVudDI2MDY0NTAwMA==,4160723,2016-11-15T13:46:38Z,2016-11-15T14:33:08Z,MEMBER,"Yes I'm actually not very happy with the `.dataset` attribute for accessing the underlying dataset object. On the other hand, similarly to `h5py` and `netCDF4`, I find it nice to have dict-like access to other nodes of the tree, e.g., `dsnode['../othernode/childnode']`. I guess this might co-exist with dict-like access to dataset variables if we ensure that there is no conflict between the names of the child nodes and the names of the dataset variables. Or maybe we can still access a child node that have the same name than a variable by writing `dsnode['./name']` instead of `dsnode['name']`. Conflicts would remain for attribute-style access anyway... @shoyer do you think that a PR for such a `DatasetNode` class has any chance of being merged at some point here? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-260162320,https://api.github.com/repos/pydata/xarray/issues/1077,260162320,MDEyOklzc3VlQ29tbWVudDI2MDE2MjMyMA==,4160723,2016-11-13T02:24:56Z,2016-11-13T02:24:56Z,MEMBER,"I've started writing a `DatasetNode` class (WIP): https://gist.github.com/benbovy/92e7c76220af1aaa4b3a0b65374e233a Currently, this is a minimal class that just implements an ""immutable"" tree of datasets (it only allows adding child nodes so that we can build a tree). ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-260156237,https://api.github.com/repos/pydata/xarray/issues/1077,260156237,MDEyOklzc3VlQ29tbWVudDI2MDE1NjIzNw==,1217238,2016-11-12T23:44:03Z,2016-11-12T23:44:03Z,MEMBER,"Maybe? A minimal class for managing groups in an open file could potentially have synergy with our backends system. Something more than that is probably out of scope. On Sat, Nov 12, 2016 at 1:00 PM tippetts notifications@github.com wrote: > Here's a new, related question: @shoyer https://github.com/shoyer , do > you have any interest in adding a class to xarray that contains a > hierarchical tree of Datasets, analogous to the groups in a netCDF or HDF5 > file? Then opening or saving such an object would be an easy but powerful > one-liner. > > Or is that something you would rather leave to someone else's module? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > https://github.com/pydata/xarray/issues/1077#issuecomment-260148070, or mute > the thread > https://github.com/notifications/unsubscribe-auth/ABKS1tnZrQ9IuRGuHPNlerQiK7v0-ak8ks5q9ijvgaJpZM4KoZZV > . ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-259412395,https://api.github.com/repos/pydata/xarray/issues/1077,259412395,MDEyOklzc3VlQ29tbWVudDI1OTQxMjM5NQ==,4160723,2016-11-09T13:18:09Z,2016-11-09T14:31:50Z,MEMBER,"> unless we want options for controlling how the MultiIndex is stored. Yes that's what I mean, something like `categories_codes`, `raw_values` and/or `hybrid` options, though I don't know if using `encoding` is appropriate here. Trying to summarize the potential use cases mentioned above: 1. If we're sure that we'll only use xarray (current or newer version) to load back the files, then the `categories_codes` option is the way to go. 2. If we want to write files that are portable across many other tools than just xarray, then we could use `reset_index` to manually switch the multi-index back into separate coordinates before writing the file. 3. If we want both 1 _and_ 2, then it would be convenient to have something in xarray that automatically resets / refactorizes the multi-index at writing / loading (this would be the hybrid option). Note that point 3 is just for more convenience, I wouldn't mind too much having to manually reset / refactorize the multi-index in that case. We indeed don't need options if point 3 is not important. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-258563550,https://api.github.com/repos/pydata/xarray/issues/1077,258563550,MDEyOklzc3VlQ29tbWVudDI1ODU2MzU1MA==,1217238,2016-11-04T22:31:17Z,2016-11-04T22:31:17Z,MEMBER,"`encodings` is only in xarray's data model. Everything there gets converted into some detail of how the data is stored in a netcdf file. So I don't think we need to use it here, unless we want options for controlling how the MultiIndex is stored. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-258476068,https://api.github.com/repos/pydata/xarray/issues/1077,258476068,MDEyOklzc3VlQ29tbWVudDI1ODQ3NjA2OA==,4160723,2016-11-04T16:14:59Z,2016-11-04T16:14:59Z,MEMBER,"I have the exact same applications than yours @tippetts, but I also would like to write netCDF files that are compatible with other tools than just xarray. With the category encoded values as the default behavior, my concern is that xarray users may be unaware that they generate netCDF files which have limited compatibility with 3rd-party tools, unless a clear warning is given in the documentation. > One consideration in favor of this is that it will soon be very easy to switch a MultiIndex back into separate coordinate variables, which could be our recommendation for how to save netCDF files for maximum portability. This should be fine, but maybe it would be nice to allow handling this automatically (at read and write) by using a specific `encoding` attribute? I haven't got much into xarray's IO and serialization logic, so I don't know if it is the right approach. This would be convenient for loading back the generated netCDF files with both xarray and 3rd-party tools, though. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-258460719,https://api.github.com/repos/pydata/xarray/issues/1077,258460719,MDEyOklzc3VlQ29tbWVudDI1ODQ2MDcxOQ==,1217238,2016-11-04T15:22:12Z,2016-11-04T15:22:12Z,MEMBER,"> Personally I'd vote for the category encoded values. If I make files with a newer xarray, I'll be reading them later with the same (or newer) xarray and I'd definitely want the exact MultiIndex back. Point taken -- let's see what others think! One consideration in favor of this is that it will soon be very easy to switch a MultiIndex back into separate coordinate variables, which could be our recommendation for how to save netCDF files for maximum portability. > The one thing I'm wondering is, what happens in an application like this if you select on one index (say, all data rows with region_name='FOOBAR-1') from the HDF5 file before doing anything else? Would it hard to make the MultiIndex/NetCDF reader smart enough not to reconstruct the whole MultiIndex before picking out the relevant rows? We could do this, but note that we are contemplating switching xarray to always load indexes into memory eagerly, which would negate that advantage. See this PR and mailing list discussion: https://github.com/pydata/xarray/pull/1024#issuecomment-256114879 https://groups.google.com/forum/#!topic/xarray/dK2RHUls1nQ > Nuts and bolts questions: So each of index.levels would be easy to store as its own little DataArray, yeah? Then would each of the index.labels be in its own DataArray, or would you want them all in the same 2D DataArray? pandas stores levels separately, automatically putting each of them in the smallest possible dtype (`int8`, `int16`, `int32` or `int64`). So we also probably want to store them in separate 1D variables. > And then would the actual data in the original DataArray just have a generic integer index as a placeholder, to be replaced by the MultiIndex? Just a note: for interacting with backends, we use `Variable` objects instead of DataArrays: http://xarray.pydata.org/en/stable/internals.html#variable-objects This means that we don't need the generic integer placeholder index (which will also be going away shortly in general, see https://github.com/pydata/xarray/pull/1017). ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-258323743,https://api.github.com/repos/pydata/xarray/issues/1077,258323743,MDEyOklzc3VlQ29tbWVudDI1ODMyMzc0Mw==,1217238,2016-11-04T01:38:41Z,2016-11-04T01:38:56Z,MEMBER,"This is a good question -- I don't think we've figured it out yet. Maybe you have ideas? The main question (to me) is whether we should store **raw values** for each level in a MultiIndex (closer to what you see), or **category encoded values** (closer to the MultiIndex implementation). To more concrete, here it what these look like for an example MultiIndex: ``` In [1]: index = pd.MultiIndex.from_product([[1, 2, 3], ['a', 'b']], names=['numbers', 'letters']) In [2]: index Out[2]: MultiIndex(levels=[[1, 2, 3], ['a', 'b']], labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]], names=['numbers', 'letters']) In [3]: index.values Out[3]: array([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'b')], dtype=object) # categorical encoded values In [4]: index.levels, index.labels Out[4]: (FrozenList([[1, 2, 3], ['a', 'b']]), FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])) # raw values In [5]: index.get_level_values(0), index.get_level_values(1) Out[5]: (Int64Index([1, 1, 2, 2, 3, 3], dtype='int64', name='numbers'), Index(['a', 'b', 'a', 'b', 'a', 'b'], dtype='object', name='letters')) ``` Advantages of storing raw values: - It's easier to work with MultiIndex levels without xarray, or with older versions of xarray (no need to combine levels and labels). - Avoiding the overhead of saving integer codes can save memory if levels have dtypes with small fixed sizes (e.g., float, int or datetime) or mostly distinct values. Advantages of storing category encoded values: - It's cheaper to construct the MultiIndex, because we have already factorized each level. - It can result in significant memory savings if levels are mostly duplicated (e.g., a tensor product) or have large itemsize (e.g., long strings). - We can restore the exact same MultiIndex, instead of refactorizing it. This manifests itself in a few edge cases that could make for a frustrating user experience (changed dimension order after stacking: https://github.com/pydata/xarray/issues/980). Perhaps the best approach would be a hybrid: store raw values, as well as de-duplicated integer codes specifying the order of values in each MultiIndex `level`. This will be a little slower than just storing the raw values, but has the correctness guarantee provided by storing category encoded values. Either way, we will need to store an attribute or two with metadata for how to restore the levels (e.g., `'multiindex_levels: numbers letters'`). ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161