html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/1077#issuecomment-1270514913,https://api.github.com/repos/pydata/xarray/issues/1077,1270514913,IC_kwDOAMm_X85LuoTh,2448579,2022-10-06T18:31:51Z,2022-10-06T18:31:51Z,MEMBER,Thanks @lucianopaz I fixed some errors when I added it to [cf-xarray](https://cf-xarray.readthedocs.io/en/latest/coding.html) It would be good to see if that version works for you.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161
https://github.com/pydata/xarray/issues/1077#issuecomment-1101505074,https://api.github.com/repos/pydata/xarray/issues/1077,1101505074,IC_kwDOAMm_X85Bp6Iy,2448579,2022-04-18T15:36:19Z,2022-04-18T15:36:19Z,MEMBER,"I added the ""compression by gathering"" scheme to cf-xarray.
1. https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.encode_multi_index_as_compress.html
1. https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.decode_compress_to_multi_index.html","{""total_count"": 2, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 2, ""rocket"": 0, ""eyes"": 0}",,187069161
https://github.com/pydata/xarray/issues/1077#issuecomment-645416425,https://api.github.com/repos/pydata/xarray/issues/1077,645416425,MDEyOklzc3VlQ29tbWVudDY0NTQxNjQyNQ==,2448579,2020-06-17T14:40:19Z,2020-06-17T14:40:19Z,MEMBER,"@shoyer I now understand your earlier comment.
I agree that it should work with both sparse and MultiIndex but as such there's no way to decide whether this should be decoded to a sparse array or a MultiIndexed dense array.
Following your comment in https://github.com/pydata/xarray/issues/3213#issuecomment-521533999
> Fortunately, there does seems to be a CF convention that would be a good fit for for sparse data in COO format, namely the indexed ragged array representation (example, note the instance_dimension attribute). That's probably the right thing to use for sparse arrays in xarray.
How about using this ""compression by gathering"" idea for MultiIndexed dense arrays and ""indexed ragged arrays"" for sparse arrays? I do not know the internals of `sparse` or the details of the CF conventions to have a strong opinion on which representation to prefer for `sparse.COO` arrays.
PS: CF convention for ""indexed ragged arrays"" is here: http://cfconventions.org/cf-conventions/cf-conventions.html#_indexed_ragged_array_representation
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161
https://github.com/pydata/xarray/issues/1077#issuecomment-644803374,https://api.github.com/repos/pydata/xarray/issues/1077,644803374,MDEyOklzc3VlQ29tbWVudDY0NDgwMzM3NA==,2448579,2020-06-16T14:31:23Z,2020-06-16T14:31:23Z,MEMBER,"I may be missing something but @fujiisoup's concern is addressed by the scheme in the CF conventions.
> In your encoded, how can we tell the MultiIndex is [('a', 1), ('b', 1), ('a', 2), ('b', 2)] or [('a', 1), ('a', 2), ('b', 1), ('b', 2)]?
The information about ordering is stored as 1D indexes of an ND array; constructed using `np.ravel_multi_index` in the `encode_multiindex` function:
`encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape)`
For example, see the dimension coordinate `landpoint` in the encoded form
```
>>> ds3
Dimensions: (landpoint: 4)
Coordinates:
* landpoint (landpoint) MultiIndex
- lat (landpoint) object 'a' 'b' 'b' 'a'
- lon (landpoint) int64 1 2 1 2
Data variables:
landsoilt (landpoint) float64 -0.2699 -1.228 0.4632 0.2287
```
```
>>> encode_multiindex(ds3, ""landpoint"")
Dimensions: (landpoint: 4, lat: 2, lon: 2)
Coordinates:
* lat (lat) object 'a' 'b'
* lon (lon) int64 1 2
* landpoint (landpoint) int64 0 3 2 1
Data variables:
landsoilt (landpoint) float64 -0.2699 -1.228 0.4632 0.2287
```
Here is a cleaned up version of the code for easy testing
``` python
import numpy as np
import pandas as pd
import xarray as xr
def encode_multiindex(ds, idxname):
encoded = ds.reset_index(idxname)
coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels))
for coord in coords:
encoded[coord] = coords[coord].values
shape = [encoded.sizes[coord] for coord in coords]
encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape)
encoded[idxname].attrs[""compress""] = "" "".join(ds.indexes[idxname].names)
return encoded
def decode_to_multiindex(encoded, idxname):
names = encoded[idxname].attrs[""compress""].split("" "")
shape = [encoded.sizes[dim] for dim in names]
indices = np.unravel_index(encoded.landpoint.values, shape)
arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)]
mindex = pd.MultiIndex.from_arrays(arrays)
decoded = xr.Dataset({}, {idxname: mindex})
for varname in encoded.data_vars:
if idxname in encoded[varname].dims:
decoded[varname] = (idxname, encoded[varname].values)
return decoded
ds1 = xr.Dataset(
{""landsoilt"": (""landpoint"", np.random.randn(4))},
{
""landpoint"": pd.MultiIndex.from_product(
[[""a"", ""b""], [1, 2]], names=(""lat"", ""lon"")
)
},
)
ds2 = xr.Dataset(
{""landsoilt"": (""landpoint"", np.random.randn(4))},
{
""landpoint"": pd.MultiIndex.from_arrays(
[[""a"", ""b"", ""c"", ""d""], [1, 2, 4, 10]], names=(""lat"", ""lon"")
)
},
)
ds3 = xr.Dataset(
{""landsoilt"": (""landpoint"", np.random.randn(4))},
{
""landpoint"": pd.MultiIndex.from_arrays(
[[""a"", ""b"", ""b"", ""a""], [1, 2, 1, 2]], names=(""lat"", ""lon"")
)
},
)
idxname = ""landpoint""
for dataset in [ds1, ds2, ds3]:
xr.testing.assert_identical(
decode_to_multiindex(encode_multiindex(dataset, idxname), idxname), dataset
)
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161
https://github.com/pydata/xarray/issues/1077#issuecomment-644442679,https://api.github.com/repos/pydata/xarray/issues/1077,644442679,MDEyOklzc3VlQ29tbWVudDY0NDQ0MjY3OQ==,2448579,2020-06-15T23:29:11Z,2020-06-15T23:38:30Z,MEMBER,"This seems to be possible following http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#compression-by-gathering
Here is a quick proof of concept:
``` python
import numpy as np
import pandas as pd
import xarray as xr
# example 1
ds = xr.Dataset(
{""landsoilt"": (""landpoint"", np.random.randn(4))},
{
""landpoint"": pd.MultiIndex.from_product(
[[""a"", ""b""], [1, 2]], names=(""lat"", ""lon"")
)
},
)
# example 2
# ds = xr.Dataset(
# {""landsoilt"": (""landpoint"", np.random.randn(4))},
# {
# ""landpoint"": pd.MultiIndex.from_arrays(
# [[""a"", ""b"", ""c"", ""d""], [1, 2, 4, 10]], names=(""lat"", ""lon"")
# )
# },
# )
# encode step
# detect using isinstance(index, pd.MultiIndex)
idxname = ""landpoint""
encoded = ds.reset_index(idxname)
coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels))
for coord in coords:
encoded[coord] = coords[coord].values
shape = [encoded.sizes[coord] for coord in coords]
encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape)
encoded[idxname].attrs[""compress""] = "" "".join(ds.indexes[idxname].names)
# decode step
# detect using ""compress"" in var.attrs
idxname = ""landpoint""
names = encoded[idxname].attrs[""compress""].split("" "")
shape = [encoded.sizes[dim] for dim in names]
indices = np.unravel_index(encoded.landpoint.values, shape)
arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)]
mindex = pd.MultiIndex.from_arrays(arrays)
decoded = xr.Dataset({}, {idxname: mindex})
decoded[""landsoilt""] = (idxname, encoded[""landsoilt""].values)
xr.testing.assert_identical(decoded, ds)
```
`encoded` can be serialized using our existing code:
```
Dimensions: (landpoint: 4, lat: 2, lon: 2)
Coordinates:
* lat (lat) object 'a' 'b'
* lon (lon) int64 1 2
* landpoint (landpoint) int64 0 1 2 3
Data variables:
landsoilt (landpoint) float64 -1.668 -1.003 1.084 1.963
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187069161