home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

30 rows where issue = 187069161 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 10

  • shoyer 9
  • dcherian 5
  • tippetts 5
  • benbovy 4
  • fujiisoup 2
  • kefirbandi 1
  • mullenkamp 1
  • lucianopaz 1
  • seth-p 1
  • stale[bot] 1

author_association 3

  • MEMBER 20
  • NONE 8
  • CONTRIBUTOR 2

issue 1

  • MultiIndex serialization to NetCDF · 30 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1270514913 https://github.com/pydata/xarray/issues/1077#issuecomment-1270514913 https://api.github.com/repos/pydata/xarray/issues/1077 IC_kwDOAMm_X85LuoTh dcherian 2448579 2022-10-06T18:31:51Z 2022-10-06T18:31:51Z MEMBER

Thanks @lucianopaz I fixed some errors when I added it to cf-xarray It would be good to see if that version works for you.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
1268380486 https://github.com/pydata/xarray/issues/1077#issuecomment-1268380486 https://api.github.com/repos/pydata/xarray/issues/1077 IC_kwDOAMm_X85LmfNG lucianopaz 5230109 2022-10-05T12:38:05Z 2022-10-05T12:38:05Z NONE

Hi everyone, first of all, thanks for your amazing work! I came across this issue today because I have a dataset with multiple variables and multiple multi index dimensions, some of which aren't used in some variable. I had to slightly adapt the workaround posted by @dcherian to get things to work. I'll post it here if someone else finds the patch useful. I'm not sure if it would be a viable fix for the issue though, let me know if it is and I'll open a PR.

```python def encode_multiindex(ds, idxname): encoded = ds.reset_index(idxname) coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels)) for coord in coords: encoded[coord] = coords[coord].values shape = [encoded.sizes[coord] for coord in coords] encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape) encoded[idxname].attrs["compress"] = " ".join(ds.indexes[idxname].names) return encoded

def decode_to_multiindex(encoded, idxname): names = encoded[idxname].attrs["compress"].split(" ") shape = [encoded.sizes[dim] for dim in names] indices = np.unravel_index(encoded[idxname].values, shape) arrays = np.array([encoded[dim].values[index] for dim, index in zip(names, indices)]) mindex = pd.MultiIndex.from_arrays(arrays, names=names)

decoded = xr.Dataset(
    {},
    dict(
        **{idxname: mindex},
        **{dim: encoded.coords[dim] for dim in encoded.dims if dim not in [idxname] + names}
    )
)
for varname in encoded.data_vars:
    if idxname in encoded[varname].dims:
        decoded[varname] = (encoded[varname].dims, encoded[varname].values)
    else:
        decoded[varname] = encoded[varname]
return decoded

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
1101505074 https://github.com/pydata/xarray/issues/1077#issuecomment-1101505074 https://api.github.com/repos/pydata/xarray/issues/1077 IC_kwDOAMm_X85Bp6Iy dcherian 2448579 2022-04-18T15:36:19Z 2022-04-18T15:36:19Z MEMBER

I added the "compression by gathering" scheme to cf-xarray. 1. https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.encode_multi_index_as_compress.html 1. https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.decode_compress_to_multi_index.html

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 2,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
1101089866 https://github.com/pydata/xarray/issues/1077#issuecomment-1101089866 https://api.github.com/repos/pydata/xarray/issues/1077 IC_kwDOAMm_X85BoUxK stale[bot] 26384082 2022-04-18T04:43:45Z 2022-04-18T04:43:45Z NONE

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
645416425 https://github.com/pydata/xarray/issues/1077#issuecomment-645416425 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDY0NTQxNjQyNQ== dcherian 2448579 2020-06-17T14:40:19Z 2020-06-17T14:40:19Z MEMBER

@shoyer I now understand your earlier comment.

I agree that it should work with both sparse and MultiIndex but as such there's no way to decide whether this should be decoded to a sparse array or a MultiIndexed dense array.

Following your comment in https://github.com/pydata/xarray/issues/3213#issuecomment-521533999

Fortunately, there does seems to be a CF convention that would be a good fit for for sparse data in COO format, namely the indexed ragged array representation (example, note the instance_dimension attribute). That's probably the right thing to use for sparse arrays in xarray.

How about using this "compression by gathering" idea for MultiIndexed dense arrays and "indexed ragged arrays" for sparse arrays? I do not know the internals of sparse or the details of the CF conventions to have a strong opinion on which representation to prefer for sparse.COO arrays.

PS: CF convention for "indexed ragged arrays" is here: http://cfconventions.org/cf-conventions/cf-conventions.html#_indexed_ragged_array_representation

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
645142014 https://github.com/pydata/xarray/issues/1077#issuecomment-645142014 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDY0NTE0MjAxNA== shoyer 1217238 2020-06-17T04:28:56Z 2020-06-17T04:28:56Z MEMBER

It still isn't clear to me why this is a better representation for a MultiIndex than a sparse array.

I guess it could work fine for either, but we would need to pick a convention.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
645139667 https://github.com/pydata/xarray/issues/1077#issuecomment-645139667 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDY0NTEzOTY2Nw== fujiisoup 6815844 2020-06-17T04:21:40Z 2020-06-17T04:21:40Z MEMBER

@dcherian. Now I understood. Your working examples were really nice for me to understand the idea. Thank you for this clarification.

I think the use of this convention is the best idea to save MultiIndex in netCDF. Maybe we can start implementing this?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
644803374 https://github.com/pydata/xarray/issues/1077#issuecomment-644803374 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDY0NDgwMzM3NA== dcherian 2448579 2020-06-16T14:31:23Z 2020-06-16T14:31:23Z MEMBER

I may be missing something but @fujiisoup's concern is addressed by the scheme in the CF conventions.

In your encoded, how can we tell the MultiIndex is [('a', 1), ('b', 1), ('a', 2), ('b', 2)] or [('a', 1), ('a', 2), ('b', 1), ('b', 2)]?

The information about ordering is stored as 1D indexes of an ND array; constructed using np.ravel_multi_index in the encode_multiindex function:

encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape)

For example, see the dimension coordinate landpoint in the encoded form ```

ds3 <xarray.Dataset> Dimensions: (landpoint: 4) Coordinates: * landpoint (landpoint) MultiIndex - lat (landpoint) object 'a' 'b' 'b' 'a' - lon (landpoint) int64 1 2 1 2 Data variables: landsoilt (landpoint) float64 -0.2699 -1.228 0.4632 0.2287 encode_multiindex(ds3, "landpoint") <xarray.Dataset> Dimensions: (landpoint: 4, lat: 2, lon: 2) Coordinates: * lat (lat) object 'a' 'b' * lon (lon) int64 1 2 * landpoint (landpoint) int64 0 3 2 1 Data variables: landsoilt (landpoint) float64 -0.2699 -1.228 0.4632 0.2287 ```

Here is a cleaned up version of the code for easy testing ``` python import numpy as np import pandas as pd import xarray as xr

def encode_multiindex(ds, idxname): encoded = ds.reset_index(idxname) coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels)) for coord in coords: encoded[coord] = coords[coord].values shape = [encoded.sizes[coord] for coord in coords] encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape) encoded[idxname].attrs["compress"] = " ".join(ds.indexes[idxname].names) return encoded

def decode_to_multiindex(encoded, idxname): names = encoded[idxname].attrs["compress"].split(" ") shape = [encoded.sizes[dim] for dim in names] indices = np.unravel_index(encoded.landpoint.values, shape) arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)] mindex = pd.MultiIndex.from_arrays(arrays)

decoded = xr.Dataset({}, {idxname: mindex})
for varname in encoded.data_vars:
    if idxname in encoded[varname].dims:
        decoded[varname] = (idxname, encoded[varname].values)
return decoded

ds1 = xr.Dataset( {"landsoilt": ("landpoint", np.random.randn(4))}, { "landpoint": pd.MultiIndex.from_product( [["a", "b"], [1, 2]], names=("lat", "lon") ) }, )

ds2 = xr.Dataset( {"landsoilt": ("landpoint", np.random.randn(4))}, { "landpoint": pd.MultiIndex.from_arrays( [["a", "b", "c", "d"], [1, 2, 4, 10]], names=("lat", "lon") ) }, )

ds3 = xr.Dataset( {"landsoilt": ("landpoint", np.random.randn(4))}, { "landpoint": pd.MultiIndex.from_arrays( [["a", "b", "b", "a"], [1, 2, 1, 2]], names=("lat", "lon") ) }, )

idxname = "landpoint" for dataset in [ds1, ds2, ds3]: xr.testing.assert_identical( decode_to_multiindex(encode_multiindex(dataset, idxname), idxname), dataset ) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
644451622 https://github.com/pydata/xarray/issues/1077#issuecomment-644451622 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDY0NDQ1MTYyMg== shoyer 1217238 2020-06-16T00:00:40Z 2020-06-16T00:00:40Z MEMBER

I agree with @fujiisoup. I think this "compression-by-gathering" representation makes more sense for sparse arrays than for a MultiIndex, per se.

That said, MultiIndex and sparse arrays are basically two sides of the same idea. In the long term, it might make sense to try only support one of the two.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
644447471 https://github.com/pydata/xarray/issues/1077#issuecomment-644447471 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDY0NDQ0NzQ3MQ== fujiisoup 6815844 2020-06-15T23:45:27Z 2020-06-15T23:45:27Z MEMBER

@dcherian I think the problem is how to serialize MultiIndex objects rather than the array itself. In your encoded, how can we tell the MultiIndex is [('a', 1), ('b', 1), ('a', 2), ('b', 2)] or [('a', 1), ('a', 2), ('b', 1), ('b', 2)]? Maybe we need to store similar objects to landpoint for level variables, such as latpoint and lonpoint.

I think just using reset_index is simpler and easier to restore.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
644442679 https://github.com/pydata/xarray/issues/1077#issuecomment-644442679 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDY0NDQ0MjY3OQ== dcherian 2448579 2020-06-15T23:29:11Z 2020-06-15T23:38:30Z MEMBER

This seems to be possible following http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#compression-by-gathering

Here is a quick proof of concept:

``` python import numpy as np import pandas as pd import xarray as xr

example 1

ds = xr.Dataset( {"landsoilt": ("landpoint", np.random.randn(4))}, { "landpoint": pd.MultiIndex.from_product( [["a", "b"], [1, 2]], names=("lat", "lon") ) }, )

example 2

ds = xr.Dataset(

{"landsoilt": ("landpoint", np.random.randn(4))},

{

"landpoint": pd.MultiIndex.from_arrays(

[["a", "b", "c", "d"], [1, 2, 4, 10]], names=("lat", "lon")

)

},

)

encode step

detect using isinstance(index, pd.MultiIndex)

idxname = "landpoint" encoded = ds.reset_index(idxname) coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels)) for coord in coords: encoded[coord] = coords[coord].values shape = [encoded.sizes[coord] for coord in coords] encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape) encoded[idxname].attrs["compress"] = " ".join(ds.indexes[idxname].names)

decode step

detect using "compress" in var.attrs

idxname = "landpoint"
names = encoded[idxname].attrs["compress"].split(" ") shape = [encoded.sizes[dim] for dim in names] indices = np.unravel_index(encoded.landpoint.values, shape) arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)] mindex = pd.MultiIndex.from_arrays(arrays)

decoded = xr.Dataset({}, {idxname: mindex}) decoded["landsoilt"] = (idxname, encoded["landsoilt"].values)

xr.testing.assert_identical(decoded, ds)

```

encoded can be serialized using our existing code: <xarray.Dataset> Dimensions: (landpoint: 4, lat: 2, lon: 2) Coordinates: * lat (lat) object 'a' 'b' * lon (lon) int64 1 2 * landpoint (landpoint) int64 0 1 2 3 Data variables: landsoilt (landpoint) float64 -1.668 -1.003 1.084 1.963

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
478058340 https://github.com/pydata/xarray/issues/1077#issuecomment-478058340 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDQ3ODA1ODM0MA== shoyer 1217238 2019-03-29T16:15:22Z 2019-03-29T16:15:22Z MEMBER

Once we finish https://github.com/pydata/xarray/issues/1603, that may change our perspective here a little bit (and could indirectly solve this problem).

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
478056621 https://github.com/pydata/xarray/issues/1077#issuecomment-478056621 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDQ3ODA1NjYyMQ== kefirbandi 1277781 2019-03-29T16:10:24Z 2019-03-29T16:10:24Z CONTRIBUTOR

I now came across this issue, which still seems to be open. Are the statements made earlier still valid? Are there any concrete plans maybe to fix this in the near future?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
436015893 https://github.com/pydata/xarray/issues/1077#issuecomment-436015893 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDQzNjAxNTg5Mw== seth-p 7441788 2018-11-05T20:03:48Z 2018-11-05T20:03:48Z CONTRIBUTOR

This code isn't particularly pretty, and I'm not sure if it handles all cases, but it enables serialization of MultiIndex indices by calling ds.mi.encode_multiindices() before serializing and ds.mi.decode_multiindices() after deserializing.

``` @xr.register_dataset_accessor('mi') class MiscDatasetAccessor(): def init(self, xarray_obj): self._obj = xarray_obj

def encode_multiindices(self):
    result = self._obj
    for name, index in list(self._obj.indexes.items()):
        if isinstance(index, pd.MultiIndex):
            temp_name = '__' + name
            new_coords = {'{}__{}'.format(temp_name, level_name): level_values.rename(None)
                          for level_name, level_values in zip(index.names, index.levels)}
            new_coords[temp_name] = xr.DataArray(index.labels,
                                                 dims=('{}__names__'.format(temp_name),
                                                       '{}__num__'.format(temp_name)),
                                                 coords={'{}__names__'.format(temp_name): index.names,
                                                         '{}__num__'.format(temp_name): list(range(len(index)))},
                                                 attrs={'__is_multiindex': 1})
            result = result.drop(name).assign_coords(**new_coords)
    return result

def decode_multiindices(self):
    result = self._obj
    for temp_name, da in list(self._obj.coords.items()):
        if temp_name.startswith('__') and da.attrs.get('__is_multiindex', False):
            name = temp_name[2:]
            level_names = da.coords['{}__names__'.format(temp_name)].values
            levels = [result.coords['{}__{}'.format(temp_name, level_name)].values for level_name in level_names]
            labels = da.values
            result = result.assign_coords(**{name: pd.MultiIndex(levels=levels, labels=labels, names=level_names)})
            result = result.drop(['{}__{}'.format(temp_name, level_name) for level_name in level_names] +
                                 list(da.dims) + [temp_name])
    return result

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
286176727 https://github.com/pydata/xarray/issues/1077#issuecomment-286176727 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI4NjE3NjcyNw== shoyer 1217238 2017-03-13T17:14:37Z 2017-03-13T17:14:37Z MEMBER

Let's recap the options, which I'll illustrate for the second level of my MultiIndex from above (https://github.com/pydata/xarray/issues/1077#issuecomment-258323743):

  1. "categories and codes": e.g., ['a', 'b'] and [0, 1, 0, 1, 0, 1]. Highest speed, low memory requirements, faithful round-trip to xarray/pandas, less obvious representation.
  2. "categories and values": e.g., ['a', 'b'] and ['a', 'b', 'a', 'b', 'a', 'b']. Moderate speed (need recreate codes), high memory requirements, faithful round-trip to xarray/pandas, more obvious representation (categories can be safely ignored).
  3. "raw values": e.g., ['a', 'b', 'a', 'b', 'a', 'b']. Moderate speed (only slightly slower than 2), high memory requirements (slightly better than 2), does not support completely faithful roundtrip, most obvious representation.
  4. "category codes and values": e.g., [0, 1] and ['a', 'b', 'a', 'b', 'a', 'b']. Moderate speed, high memory requirements, also does not support faithful roundtrip (it's possible for some levels to not be represented in the MultiIndex values), more obvious representation (like 2).

3 uses only slightly less memory than 2 and can be easily achieved with reset_index(), so I don't see a reason to support it for writing (read support would be fine).

4 looks like a faithful roundtrip, but actually isn't in some rare edge cases. That seems like a recipe for disaster, so it should be OK.

This leaves 1 and 2. Both are reasonably performant and roundtrip xarray objects with complete fidelity, so I would be happy with either them. In principle we could even support both, with an argument to switch between the modes (one would need to be the default).

My inclination is start with only supporting 1, because it has a potentially large advantage from a speed/memory perspective, and it's easy to achieve the "raw values" representation with .reset_index() (and convert back with .set_index()). If we do this, the documentation for writing netCDF files should definitely include a suggestion to consider using .reset_index() when distributing files not intended strictly for use by xarray users.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
285972485 https://github.com/pydata/xarray/issues/1077#issuecomment-285972485 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI4NTk3MjQ4NQ== mullenkamp 2656596 2017-03-12T20:12:42Z 2017-03-12T20:12:42Z NONE

I would love to have this functionality as well. Unfortunately, I'm not knowledgeable enough to help decide on the internal structure for multiindeces though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
261106183 https://github.com/pydata/xarray/issues/1077#issuecomment-261106183 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI2MTEwNjE4Mw== tippetts 17055041 2016-11-16T23:27:05Z 2016-11-16T23:27:05Z NONE

Yes, I suppose it doesn't really need to live in core xarray, unless you did want to allow a Dataset to contain other Datasets.

@benbovy , do you plan to put your DatasetNode code into some other package?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
260686932 https://github.com/pydata/xarray/issues/1077#issuecomment-260686932 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI2MDY4NjkzMg== shoyer 1217238 2016-11-15T16:16:47Z 2016-11-15T16:16:47Z MEMBER

DatasetNode feels a little too complex to me and disjoint from the rest of the package. I don't know when I would recommend using a DatasetNode to store data.

Also, as written I don't see any aspects that need to live in core xarray -- it seems that it can mostly be done with the external interface. So I would suggest the separate package.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
260645000 https://github.com/pydata/xarray/issues/1077#issuecomment-260645000 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI2MDY0NTAwMA== benbovy 4160723 2016-11-15T13:46:38Z 2016-11-15T14:33:08Z MEMBER

Yes I'm actually not very happy with the .dataset attribute for accessing the underlying dataset object. On the other hand, similarly to h5py and netCDF4, I find it nice to have dict-like access to other nodes of the tree, e.g., dsnode['../othernode/childnode']. I guess this might co-exist with dict-like access to dataset variables if we ensure that there is no conflict between the names of the child nodes and the names of the dataset variables. Or maybe we can still access a child node that have the same name than a variable by writing dsnode['./name'] instead of dsnode['name']. Conflicts would remain for attribute-style access anyway...

@shoyer do you think that a PR for such a DatasetNode class has any chance of being merged at some point here?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
260167167 https://github.com/pydata/xarray/issues/1077#issuecomment-260167167 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI2MDE2NzE2Nw== tippetts 17055041 2016-11-13T04:56:43Z 2016-11-13T04:56:43Z NONE

Would it be too simplistic to think that xarray.Dataset (or a subclass of it) could be made to contain other Datasets? That would extend the conceptual map of xarray.Dataset <==> HDF5 group. The contained Datasets would probably also want to have a reference to their parent Dataset, for walking back up the tree.

I think that is similar to what you've done, @benbovy , but with inheritance rather than composition. I understand that is an often disfavored design pattern, but it would it make sense in this case and keep the overall xarray interface simple?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
260162320 https://github.com/pydata/xarray/issues/1077#issuecomment-260162320 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI2MDE2MjMyMA== benbovy 4160723 2016-11-13T02:24:56Z 2016-11-13T02:24:56Z MEMBER

I've started writing a DatasetNode class (WIP): https://gist.github.com/benbovy/92e7c76220af1aaa4b3a0b65374e233a

Currently, this is a minimal class that just implements an "immutable" tree of datasets (it only allows adding child nodes so that we can build a tree).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
260156237 https://github.com/pydata/xarray/issues/1077#issuecomment-260156237 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI2MDE1NjIzNw== shoyer 1217238 2016-11-12T23:44:03Z 2016-11-12T23:44:03Z MEMBER

Maybe? A minimal class for managing groups in an open file could potentially have synergy with our backends system. Something more than that is probably out of scope. On Sat, Nov 12, 2016 at 1:00 PM tippetts notifications@github.com wrote:

Here's a new, related question: @shoyer https://github.com/shoyer , do you have any interest in adding a class to xarray that contains a hierarchical tree of Datasets, analogous to the groups in a netCDF or HDF5 file? Then opening or saving such an object would be an easy but powerful one-liner.

Or is that something you would rather leave to someone else's module?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1077#issuecomment-260148070, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1tnZrQ9IuRGuHPNlerQiK7v0-ak8ks5q9ijvgaJpZM4KoZZV .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
260148070 https://github.com/pydata/xarray/issues/1077#issuecomment-260148070 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI2MDE0ODA3MA== tippetts 17055041 2016-11-12T21:00:31Z 2016-11-12T21:00:31Z NONE

Here's a new, related question: @shoyer , do you have any interest in adding a class to xarray that contains a hierarchical tree of Datasets, analogous to the groups in a netCDF or HDF5 file? Then opening or saving such an object would be an easy but powerful one-liner.

Or is that something you would rather leave to someone else's module?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
259412395 https://github.com/pydata/xarray/issues/1077#issuecomment-259412395 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI1OTQxMjM5NQ== benbovy 4160723 2016-11-09T13:18:09Z 2016-11-09T14:31:50Z MEMBER

unless we want options for controlling how the MultiIndex is stored.

Yes that's what I mean, something like categories_codes, raw_values and/or hybrid options, though I don't know if using encoding is appropriate here.

Trying to summarize the potential use cases mentioned above: 1. If we're sure that we'll only use xarray (current or newer version) to load back the files, then the categories_codes option is the way to go. 2. If we want to write files that are portable across many other tools than just xarray, then we could use reset_index to manually switch the multi-index back into separate coordinates before writing the file. 3. If we want both 1 and 2, then it would be convenient to have something in xarray that automatically resets / refactorizes the multi-index at writing / loading (this would be the hybrid option).

Note that point 3 is just for more convenience, I wouldn't mind too much having to manually reset / refactorize the multi-index in that case. We indeed don't need options if point 3 is not important.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
258563550 https://github.com/pydata/xarray/issues/1077#issuecomment-258563550 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI1ODU2MzU1MA== shoyer 1217238 2016-11-04T22:31:17Z 2016-11-04T22:31:17Z MEMBER

encodings is only in xarray's data model. Everything there gets converted into some detail of how the data is stored in a netcdf file. So I don't think we need to use it here, unless we want options for controlling how the MultiIndex is stored.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
258560862 https://github.com/pydata/xarray/issues/1077#issuecomment-258560862 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI1ODU2MDg2Mg== tippetts 17055041 2016-11-04T22:14:05Z 2016-11-04T22:14:05Z NONE

So if I'm properly understanding and synthesizing your ( @benbovy and @shoyer ) comments: We want the hybrid format for maximum compatibility, with the MultiIndex split into separate 1D raw value coordinates. Using the example above, these would be [1, 1, 2, 2, 3, 3] and ['a', 'b', 'a', 'b', 'a', 'b']. The information about which coordinates are in a MultiIndex (and their order) gets saved in an attribute on the data in the file, like data.attrs['multiindex_levels'] = 'numbers letters'. So 3rd-party tools (or older xarray) will have the non-MultiIndex coords to use, but newer xarray will see the 'multiindex_levels' and automatically reconstruct the MultiIndex when the file is read.

@shoyer , I see what you mean about Variable or future DataArrays not needing a placeholder index. Would that still be backwards-compatible with older xarrays if a saved DataArray has one dim that is a MultiIndex and other dims that are not?

@benbovy , what does the encoding attribute do? It seems to me that, for a DataArray that's already created or loaded, xarray knows about its MultiIndexes and could do the right thing while writing to the backend file without being told to. Are you referring to the metadata in the file (like 'multiindex_levels') that ensures proper interpretation and automatic reconstruction when reading?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
258476068 https://github.com/pydata/xarray/issues/1077#issuecomment-258476068 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI1ODQ3NjA2OA== benbovy 4160723 2016-11-04T16:14:59Z 2016-11-04T16:14:59Z MEMBER

I have the exact same applications than yours @tippetts, but I also would like to write netCDF files that are compatible with other tools than just xarray. With the category encoded values as the default behavior, my concern is that xarray users may be unaware that they generate netCDF files which have limited compatibility with 3rd-party tools, unless a clear warning is given in the documentation.

One consideration in favor of this is that it will soon be very easy to switch a MultiIndex back into separate coordinate variables, which could be our recommendation for how to save netCDF files for maximum portability.

This should be fine, but maybe it would be nice to allow handling this automatically (at read and write) by using a specific encoding attribute? I haven't got much into xarray's IO and serialization logic, so I don't know if it is the right approach. This would be convenient for loading back the generated netCDF files with both xarray and 3rd-party tools, though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
258460719 https://github.com/pydata/xarray/issues/1077#issuecomment-258460719 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI1ODQ2MDcxOQ== shoyer 1217238 2016-11-04T15:22:12Z 2016-11-04T15:22:12Z MEMBER

Personally I'd vote for the category encoded values. If I make files with a newer xarray, I'll be reading them later with the same (or newer) xarray and I'd definitely want the exact MultiIndex back.

Point taken -- let's see what others think!

One consideration in favor of this is that it will soon be very easy to switch a MultiIndex back into separate coordinate variables, which could be our recommendation for how to save netCDF files for maximum portability.

The one thing I'm wondering is, what happens in an application like this if you select on one index (say, all data rows with region_name='FOOBAR-1') from the HDF5 file before doing anything else? Would it hard to make the MultiIndex/NetCDF reader smart enough not to reconstruct the whole MultiIndex before picking out the relevant rows?

We could do this, but note that we are contemplating switching xarray to always load indexes into memory eagerly, which would negate that advantage. See this PR and mailing list discussion: https://github.com/pydata/xarray/pull/1024#issuecomment-256114879 https://groups.google.com/forum/#!topic/xarray/dK2RHUls1nQ

Nuts and bolts questions: So each of index.levels would be easy to store as its own little DataArray, yeah? Then would each of the index.labels be in its own DataArray, or would you want them all in the same 2D DataArray?

pandas stores levels separately, automatically putting each of them in the smallest possible dtype (int8, int16, int32 or int64). So we also probably want to store them in separate 1D variables.

And then would the actual data in the original DataArray just have a generic integer index as a placeholder, to be replaced by the MultiIndex?

Just a note: for interacting with backends, we use Variable objects instead of DataArrays: http://xarray.pydata.org/en/stable/internals.html#variable-objects

This means that we don't need the generic integer placeholder index (which will also be going away shortly in general, see https://github.com/pydata/xarray/pull/1017).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
258351232 https://github.com/pydata/xarray/issues/1077#issuecomment-258351232 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI1ODM1MTIzMg== tippetts 17055041 2016-11-04T05:59:37Z 2016-11-04T05:59:37Z NONE

Personally I'd vote for the category encoded values. If I make files with a newer xarray, I'll be reading them later with the same (or newer) xarray and I'd definitely want the exact MultiIndex back.

I don't want to be too self-centered in my perspective in all of this. But my applications are definitely in the large-scale scientific computing area that seems to be the community norm for xarray, so I would guess many others would have a similar situation.

I generate data that are associated with nodes or elements in a mesh. The mesh is naturally split into named regions. Sometimes I need to operate on the entire dataset (including all regions) and sometimes I want to select one or more regions. So I make a MultiIndex where the first index is the region name strings, and the second index is the node (or element) number inside the region (i.e. starts over counting from 1 for each region).

So the full index is 1e5 to 1e7 long, of which there are only maybe a few hundred unique values in the string column. I would think that would greatly benefit from the category-encoded storage. And fast and reliable reconstruction of the MultiIndex is a big plus. Does this seem like a common user scenario?

The one thing I'm wondering is, what happens in an application like this if you select on one index (say, all data rows with region_name='FOOBAR-1') from the HDF5 file before doing anything else? Would it hard to make the MultiIndex/NetCDF reader smart enough not to reconstruct the whole MultiIndex before picking out the relevant rows? And, related question for us to think about, how would we make this all play nicely with dask?

Sorry for the long post. I've been very impressed and happy working with xarray, and I'm just eager to get the last bit of features I need so I can really start pushing my colleagues into using it. :)

Nuts and bolts questions: So each of index.levels would be easy to store as its own little DataArray, yeah? Then would each of the index.labels be in its own DataArray, or would you want them all in the same 2D DataArray? And then would the actual data in the original DataArray just have a generic integer index as a placeholder, to be replaced by the MultiIndex?

For these dummy DataArrays and the multiindex_levels metadata attr, how do you feel about using a single leading underscore in the name? If I were to low-level grunge around in the file for some reason, that would indicate to me that they are private-by-convention implementation details.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
258323743 https://github.com/pydata/xarray/issues/1077#issuecomment-258323743 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI1ODMyMzc0Mw== shoyer 1217238 2016-11-04T01:38:41Z 2016-11-04T01:38:56Z MEMBER

This is a good question -- I don't think we've figured it out yet. Maybe you have ideas?

The main question (to me) is whether we should store raw values for each level in a MultiIndex (closer to what you see), or category encoded values (closer to the MultiIndex implementation).

To more concrete, here it what these look like for an example MultiIndex:

``` In [1]: index = pd.MultiIndex.from_product([[1, 2, 3], ['a', 'b']], names=['numbers', 'letters'])

In [2]: index Out[2]: MultiIndex(levels=[[1, 2, 3], ['a', 'b']], labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]], names=['numbers', 'letters'])

In [3]: index.values Out[3]: array([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'b')], dtype=object)

categorical encoded values

In [4]: index.levels, index.labels Out[4]: (FrozenList([[1, 2, 3], ['a', 'b']]), FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]]))

raw values

In [5]: index.get_level_values(0), index.get_level_values(1) Out[5]: (Int64Index([1, 1, 2, 2, 3, 3], dtype='int64', name='numbers'), Index(['a', 'b', 'a', 'b', 'a', 'b'], dtype='object', name='letters')) ```

Advantages of storing raw values: - It's easier to work with MultiIndex levels without xarray, or with older versions of xarray (no need to combine levels and labels). - Avoiding the overhead of saving integer codes can save memory if levels have dtypes with small fixed sizes (e.g., float, int or datetime) or mostly distinct values.

Advantages of storing category encoded values: - It's cheaper to construct the MultiIndex, because we have already factorized each level. - It can result in significant memory savings if levels are mostly duplicated (e.g., a tensor product) or have large itemsize (e.g., long strings). - We can restore the exact same MultiIndex, instead of refactorizing it. This manifests itself in a few edge cases that could make for a frustrating user experience (changed dimension order after stacking: https://github.com/pydata/xarray/issues/980).

Perhaps the best approach would be a hybrid: store raw values, as well as de-duplicated integer codes specifying the order of values in each MultiIndex level. This will be a little slower than just storing the raw values, but has the correctness guarantee provided by storing category encoded values.

Either way, we will need to store an attribute or two with metadata for how to restore the levels (e.g., 'multiindex_levels: numbers letters').

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 18.996ms · About: xarray-datasette