html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/1077#issuecomment-1270514913,https://api.github.com/repos/pydata/xarray/issues/1077,1270514913,IC_kwDOAMm_X85LuoTh,2448579,2022-10-06T18:31:51Z,2022-10-06T18:31:51Z,MEMBER,Thanks @lucianopaz I fixed some errors when I added it to [cf-xarray](https://cf-xarray.readthedocs.io/en/latest/coding.html) It would be good to see if that version works for you.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-1101505074,https://api.github.com/repos/pydata/xarray/issues/1077,1101505074,IC_kwDOAMm_X85Bp6Iy,2448579,2022-04-18T15:36:19Z,2022-04-18T15:36:19Z,MEMBER,"I added the ""compression by gathering"" scheme to cf-xarray. 1. https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.encode_multi_index_as_compress.html 1. https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.decode_compress_to_multi_index.html","{""total_count"": 2, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 2, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-645416425,https://api.github.com/repos/pydata/xarray/issues/1077,645416425,MDEyOklzc3VlQ29tbWVudDY0NTQxNjQyNQ==,2448579,2020-06-17T14:40:19Z,2020-06-17T14:40:19Z,MEMBER,"@shoyer I now understand your earlier comment. I agree that it should work with both sparse and MultiIndex but as such there's no way to decide whether this should be decoded to a sparse array or a MultiIndexed dense array. Following your comment in https://github.com/pydata/xarray/issues/3213#issuecomment-521533999 > Fortunately, there does seems to be a CF convention that would be a good fit for for sparse data in COO format, namely the indexed ragged array representation (example, note the instance_dimension attribute). That's probably the right thing to use for sparse arrays in xarray. How about using this ""compression by gathering"" idea for MultiIndexed dense arrays and ""indexed ragged arrays"" for sparse arrays? I do not know the internals of `sparse` or the details of the CF conventions to have a strong opinion on which representation to prefer for `sparse.COO` arrays. PS: CF convention for ""indexed ragged arrays"" is here: http://cfconventions.org/cf-conventions/cf-conventions.html#_indexed_ragged_array_representation ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-644803374,https://api.github.com/repos/pydata/xarray/issues/1077,644803374,MDEyOklzc3VlQ29tbWVudDY0NDgwMzM3NA==,2448579,2020-06-16T14:31:23Z,2020-06-16T14:31:23Z,MEMBER,"I may be missing something but @fujiisoup's concern is addressed by the scheme in the CF conventions. > In your encoded, how can we tell the MultiIndex is [('a', 1), ('b', 1), ('a', 2), ('b', 2)] or [('a', 1), ('a', 2), ('b', 1), ('b', 2)]? The information about ordering is stored as 1D indexes of an ND array; constructed using `np.ravel_multi_index` in the `encode_multiindex` function: `encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape)` For example, see the dimension coordinate `landpoint` in the encoded form ``` >>> ds3 Dimensions: (landpoint: 4) Coordinates: * landpoint (landpoint) MultiIndex - lat (landpoint) object 'a' 'b' 'b' 'a' - lon (landpoint) int64 1 2 1 2 Data variables: landsoilt (landpoint) float64 -0.2699 -1.228 0.4632 0.2287 ``` ``` >>> encode_multiindex(ds3, ""landpoint"") Dimensions: (landpoint: 4, lat: 2, lon: 2) Coordinates: * lat (lat) object 'a' 'b' * lon (lon) int64 1 2 * landpoint (landpoint) int64 0 3 2 1 Data variables: landsoilt (landpoint) float64 -0.2699 -1.228 0.4632 0.2287 ``` Here is a cleaned up version of the code for easy testing ``` python import numpy as np import pandas as pd import xarray as xr def encode_multiindex(ds, idxname): encoded = ds.reset_index(idxname) coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels)) for coord in coords: encoded[coord] = coords[coord].values shape = [encoded.sizes[coord] for coord in coords] encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape) encoded[idxname].attrs[""compress""] = "" "".join(ds.indexes[idxname].names) return encoded def decode_to_multiindex(encoded, idxname): names = encoded[idxname].attrs[""compress""].split("" "") shape = [encoded.sizes[dim] for dim in names] indices = np.unravel_index(encoded.landpoint.values, shape) arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)] mindex = pd.MultiIndex.from_arrays(arrays) decoded = xr.Dataset({}, {idxname: mindex}) for varname in encoded.data_vars: if idxname in encoded[varname].dims: decoded[varname] = (idxname, encoded[varname].values) return decoded ds1 = xr.Dataset( {""landsoilt"": (""landpoint"", np.random.randn(4))}, { ""landpoint"": pd.MultiIndex.from_product( [[""a"", ""b""], [1, 2]], names=(""lat"", ""lon"") ) }, ) ds2 = xr.Dataset( {""landsoilt"": (""landpoint"", np.random.randn(4))}, { ""landpoint"": pd.MultiIndex.from_arrays( [[""a"", ""b"", ""c"", ""d""], [1, 2, 4, 10]], names=(""lat"", ""lon"") ) }, ) ds3 = xr.Dataset( {""landsoilt"": (""landpoint"", np.random.randn(4))}, { ""landpoint"": pd.MultiIndex.from_arrays( [[""a"", ""b"", ""b"", ""a""], [1, 2, 1, 2]], names=(""lat"", ""lon"") ) }, ) idxname = ""landpoint"" for dataset in [ds1, ds2, ds3]: xr.testing.assert_identical( decode_to_multiindex(encode_multiindex(dataset, idxname), idxname), dataset ) ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161 https://github.com/pydata/xarray/issues/1077#issuecomment-644442679,https://api.github.com/repos/pydata/xarray/issues/1077,644442679,MDEyOklzc3VlQ29tbWVudDY0NDQ0MjY3OQ==,2448579,2020-06-15T23:29:11Z,2020-06-15T23:38:30Z,MEMBER,"This seems to be possible following http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#compression-by-gathering Here is a quick proof of concept: ``` python import numpy as np import pandas as pd import xarray as xr # example 1 ds = xr.Dataset( {""landsoilt"": (""landpoint"", np.random.randn(4))}, { ""landpoint"": pd.MultiIndex.from_product( [[""a"", ""b""], [1, 2]], names=(""lat"", ""lon"") ) }, ) # example 2 # ds = xr.Dataset( # {""landsoilt"": (""landpoint"", np.random.randn(4))}, # { # ""landpoint"": pd.MultiIndex.from_arrays( # [[""a"", ""b"", ""c"", ""d""], [1, 2, 4, 10]], names=(""lat"", ""lon"") # ) # }, # ) # encode step # detect using isinstance(index, pd.MultiIndex) idxname = ""landpoint"" encoded = ds.reset_index(idxname) coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels)) for coord in coords: encoded[coord] = coords[coord].values shape = [encoded.sizes[coord] for coord in coords] encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape) encoded[idxname].attrs[""compress""] = "" "".join(ds.indexes[idxname].names) # decode step # detect using ""compress"" in var.attrs idxname = ""landpoint"" names = encoded[idxname].attrs[""compress""].split("" "") shape = [encoded.sizes[dim] for dim in names] indices = np.unravel_index(encoded.landpoint.values, shape) arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)] mindex = pd.MultiIndex.from_arrays(arrays) decoded = xr.Dataset({}, {idxname: mindex}) decoded[""landsoilt""] = (idxname, encoded[""landsoilt""].values) xr.testing.assert_identical(decoded, ds) ``` `encoded` can be serialized using our existing code: ``` Dimensions: (landpoint: 4, lat: 2, lon: 2) Coordinates: * lat (lat) object 'a' 'b' * lon (lon) int64 1 2 * landpoint (landpoint) int64 0 1 2 3 Data variables: landsoilt (landpoint) float64 -1.668 -1.003 1.084 1.963 ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161