issue_comments
20 rows where author_association = "MEMBER" and issue = 187069161 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: reactions, created_at (date), updated_at (date)
issue 1
- MultiIndex serialization to NetCDF · 20 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
1270514913 | https://github.com/pydata/xarray/issues/1077#issuecomment-1270514913 | https://api.github.com/repos/pydata/xarray/issues/1077 | IC_kwDOAMm_X85LuoTh | dcherian 2448579 | 2022-10-06T18:31:51Z | 2022-10-06T18:31:51Z | MEMBER | Thanks @lucianopaz I fixed some errors when I added it to cf-xarray It would be good to see if that version works for you. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
1101505074 | https://github.com/pydata/xarray/issues/1077#issuecomment-1101505074 | https://api.github.com/repos/pydata/xarray/issues/1077 | IC_kwDOAMm_X85Bp6Iy | dcherian 2448579 | 2022-04-18T15:36:19Z | 2022-04-18T15:36:19Z | MEMBER | I added the "compression by gathering" scheme to cf-xarray. 1. https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.encode_multi_index_as_compress.html 1. https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.decode_compress_to_multi_index.html |
{ "total_count": 2, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 2, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
645416425 | https://github.com/pydata/xarray/issues/1077#issuecomment-645416425 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDY0NTQxNjQyNQ== | dcherian 2448579 | 2020-06-17T14:40:19Z | 2020-06-17T14:40:19Z | MEMBER | @shoyer I now understand your earlier comment. I agree that it should work with both sparse and MultiIndex but as such there's no way to decide whether this should be decoded to a sparse array or a MultiIndexed dense array. Following your comment in https://github.com/pydata/xarray/issues/3213#issuecomment-521533999
How about using this "compression by gathering" idea for MultiIndexed dense arrays and "indexed ragged arrays" for sparse arrays? I do not know the internals of PS: CF convention for "indexed ragged arrays" is here: http://cfconventions.org/cf-conventions/cf-conventions.html#_indexed_ragged_array_representation |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
645142014 | https://github.com/pydata/xarray/issues/1077#issuecomment-645142014 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDY0NTE0MjAxNA== | shoyer 1217238 | 2020-06-17T04:28:56Z | 2020-06-17T04:28:56Z | MEMBER | It still isn't clear to me why this is a better representation for a MultiIndex than a sparse array. I guess it could work fine for either, but we would need to pick a convention. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
645139667 | https://github.com/pydata/xarray/issues/1077#issuecomment-645139667 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDY0NTEzOTY2Nw== | fujiisoup 6815844 | 2020-06-17T04:21:40Z | 2020-06-17T04:21:40Z | MEMBER | @dcherian. Now I understood. Your working examples were really nice for me to understand the idea. Thank you for this clarification. I think the use of this convention is the best idea to save MultiIndex in netCDF. Maybe we can start implementing this? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
644803374 | https://github.com/pydata/xarray/issues/1077#issuecomment-644803374 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDY0NDgwMzM3NA== | dcherian 2448579 | 2020-06-16T14:31:23Z | 2020-06-16T14:31:23Z | MEMBER | I may be missing something but @fujiisoup's concern is addressed by the scheme in the CF conventions.
The information about ordering is stored as 1D indexes of an ND array; constructed using
For example, see the dimension coordinate
Here is a cleaned up version of the code for easy testing ``` python import numpy as np import pandas as pd import xarray as xr def encode_multiindex(ds, idxname): encoded = ds.reset_index(idxname) coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels)) for coord in coords: encoded[coord] = coords[coord].values shape = [encoded.sizes[coord] for coord in coords] encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape) encoded[idxname].attrs["compress"] = " ".join(ds.indexes[idxname].names) return encoded def decode_to_multiindex(encoded, idxname): names = encoded[idxname].attrs["compress"].split(" ") shape = [encoded.sizes[dim] for dim in names] indices = np.unravel_index(encoded.landpoint.values, shape) arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)] mindex = pd.MultiIndex.from_arrays(arrays)
ds1 = xr.Dataset( {"landsoilt": ("landpoint", np.random.randn(4))}, { "landpoint": pd.MultiIndex.from_product( [["a", "b"], [1, 2]], names=("lat", "lon") ) }, ) ds2 = xr.Dataset( {"landsoilt": ("landpoint", np.random.randn(4))}, { "landpoint": pd.MultiIndex.from_arrays( [["a", "b", "c", "d"], [1, 2, 4, 10]], names=("lat", "lon") ) }, ) ds3 = xr.Dataset( {"landsoilt": ("landpoint", np.random.randn(4))}, { "landpoint": pd.MultiIndex.from_arrays( [["a", "b", "b", "a"], [1, 2, 1, 2]], names=("lat", "lon") ) }, ) idxname = "landpoint" for dataset in [ds1, ds2, ds3]: xr.testing.assert_identical( decode_to_multiindex(encode_multiindex(dataset, idxname), idxname), dataset ) ``` |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
644451622 | https://github.com/pydata/xarray/issues/1077#issuecomment-644451622 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDY0NDQ1MTYyMg== | shoyer 1217238 | 2020-06-16T00:00:40Z | 2020-06-16T00:00:40Z | MEMBER | I agree with @fujiisoup. I think this "compression-by-gathering" representation makes more sense for sparse arrays than for a MultiIndex, per se. That said, MultiIndex and sparse arrays are basically two sides of the same idea. In the long term, it might make sense to try only support one of the two. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
644447471 | https://github.com/pydata/xarray/issues/1077#issuecomment-644447471 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDY0NDQ0NzQ3MQ== | fujiisoup 6815844 | 2020-06-15T23:45:27Z | 2020-06-15T23:45:27Z | MEMBER | @dcherian
I think the problem is how to serialize I think just using |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
644442679 | https://github.com/pydata/xarray/issues/1077#issuecomment-644442679 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDY0NDQ0MjY3OQ== | dcherian 2448579 | 2020-06-15T23:29:11Z | 2020-06-15T23:38:30Z | MEMBER | This seems to be possible following http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#compression-by-gathering Here is a quick proof of concept: ``` python import numpy as np import pandas as pd import xarray as xr example 1ds = xr.Dataset( {"landsoilt": ("landpoint", np.random.randn(4))}, { "landpoint": pd.MultiIndex.from_product( [["a", "b"], [1, 2]], names=("lat", "lon") ) }, ) example 2ds = xr.Dataset({"landsoilt": ("landpoint", np.random.randn(4))},{"landpoint": pd.MultiIndex.from_arrays([["a", "b", "c", "d"], [1, 2, 4, 10]], names=("lat", "lon"))},)encode stepdetect using isinstance(index, pd.MultiIndex)idxname = "landpoint" encoded = ds.reset_index(idxname) coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels)) for coord in coords: encoded[coord] = coords[coord].values shape = [encoded.sizes[coord] for coord in coords] encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape) encoded[idxname].attrs["compress"] = " ".join(ds.indexes[idxname].names) decode stepdetect using "compress" in var.attrsidxname = "landpoint" decoded = xr.Dataset({}, {idxname: mindex}) decoded["landsoilt"] = (idxname, encoded["landsoilt"].values) xr.testing.assert_identical(decoded, ds) ```
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
478058340 | https://github.com/pydata/xarray/issues/1077#issuecomment-478058340 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDQ3ODA1ODM0MA== | shoyer 1217238 | 2019-03-29T16:15:22Z | 2019-03-29T16:15:22Z | MEMBER | Once we finish https://github.com/pydata/xarray/issues/1603, that may change our perspective here a little bit (and could indirectly solve this problem). |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
286176727 | https://github.com/pydata/xarray/issues/1077#issuecomment-286176727 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDI4NjE3NjcyNw== | shoyer 1217238 | 2017-03-13T17:14:37Z | 2017-03-13T17:14:37Z | MEMBER | Let's recap the options, which I'll illustrate for the second level of my MultiIndex from above (https://github.com/pydata/xarray/issues/1077#issuecomment-258323743):
3 uses only slightly less memory than 2 and can be easily achieved with 4 looks like a faithful roundtrip, but actually isn't in some rare edge cases. That seems like a recipe for disaster, so it should be OK. This leaves 1 and 2. Both are reasonably performant and roundtrip xarray objects with complete fidelity, so I would be happy with either them. In principle we could even support both, with an argument to switch between the modes (one would need to be the default). My inclination is start with only supporting 1, because it has a potentially large advantage from a speed/memory perspective, and it's easy to achieve the "raw values" representation with |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
260686932 | https://github.com/pydata/xarray/issues/1077#issuecomment-260686932 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDI2MDY4NjkzMg== | shoyer 1217238 | 2016-11-15T16:16:47Z | 2016-11-15T16:16:47Z | MEMBER |
Also, as written I don't see any aspects that need to live in core xarray -- it seems that it can mostly be done with the external interface. So I would suggest the separate package. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
260645000 | https://github.com/pydata/xarray/issues/1077#issuecomment-260645000 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDI2MDY0NTAwMA== | benbovy 4160723 | 2016-11-15T13:46:38Z | 2016-11-15T14:33:08Z | MEMBER | Yes I'm actually not very happy with the @shoyer do you think that a PR for such a |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
260162320 | https://github.com/pydata/xarray/issues/1077#issuecomment-260162320 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDI2MDE2MjMyMA== | benbovy 4160723 | 2016-11-13T02:24:56Z | 2016-11-13T02:24:56Z | MEMBER | I've started writing a Currently, this is a minimal class that just implements an "immutable" tree of datasets (it only allows adding child nodes so that we can build a tree). |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
260156237 | https://github.com/pydata/xarray/issues/1077#issuecomment-260156237 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDI2MDE1NjIzNw== | shoyer 1217238 | 2016-11-12T23:44:03Z | 2016-11-12T23:44:03Z | MEMBER | Maybe? A minimal class for managing groups in an open file could potentially have synergy with our backends system. Something more than that is probably out of scope. On Sat, Nov 12, 2016 at 1:00 PM tippetts notifications@github.com wrote:
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
259412395 | https://github.com/pydata/xarray/issues/1077#issuecomment-259412395 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDI1OTQxMjM5NQ== | benbovy 4160723 | 2016-11-09T13:18:09Z | 2016-11-09T14:31:50Z | MEMBER |
Yes that's what I mean, something like Trying to summarize the potential use cases mentioned above:
1. If we're sure that we'll only use xarray (current or newer version) to load back the files, then the Note that point 3 is just for more convenience, I wouldn't mind too much having to manually reset / refactorize the multi-index in that case. We indeed don't need options if point 3 is not important. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
258563550 | https://github.com/pydata/xarray/issues/1077#issuecomment-258563550 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDI1ODU2MzU1MA== | shoyer 1217238 | 2016-11-04T22:31:17Z | 2016-11-04T22:31:17Z | MEMBER |
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
258476068 | https://github.com/pydata/xarray/issues/1077#issuecomment-258476068 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDI1ODQ3NjA2OA== | benbovy 4160723 | 2016-11-04T16:14:59Z | 2016-11-04T16:14:59Z | MEMBER | I have the exact same applications than yours @tippetts, but I also would like to write netCDF files that are compatible with other tools than just xarray. With the category encoded values as the default behavior, my concern is that xarray users may be unaware that they generate netCDF files which have limited compatibility with 3rd-party tools, unless a clear warning is given in the documentation.
This should be fine, but maybe it would be nice to allow handling this automatically (at read and write) by using a specific |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
258460719 | https://github.com/pydata/xarray/issues/1077#issuecomment-258460719 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDI1ODQ2MDcxOQ== | shoyer 1217238 | 2016-11-04T15:22:12Z | 2016-11-04T15:22:12Z | MEMBER |
Point taken -- let's see what others think! One consideration in favor of this is that it will soon be very easy to switch a MultiIndex back into separate coordinate variables, which could be our recommendation for how to save netCDF files for maximum portability.
We could do this, but note that we are contemplating switching xarray to always load indexes into memory eagerly, which would negate that advantage. See this PR and mailing list discussion: https://github.com/pydata/xarray/pull/1024#issuecomment-256114879 https://groups.google.com/forum/#!topic/xarray/dK2RHUls1nQ
pandas stores levels separately, automatically putting each of them in the smallest possible dtype (
Just a note: for interacting with backends, we use This means that we don't need the generic integer placeholder index (which will also be going away shortly in general, see https://github.com/pydata/xarray/pull/1017). |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 | |
258323743 | https://github.com/pydata/xarray/issues/1077#issuecomment-258323743 | https://api.github.com/repos/pydata/xarray/issues/1077 | MDEyOklzc3VlQ29tbWVudDI1ODMyMzc0Mw== | shoyer 1217238 | 2016-11-04T01:38:41Z | 2016-11-04T01:38:56Z | MEMBER | This is a good question -- I don't think we've figured it out yet. Maybe you have ideas? The main question (to me) is whether we should store raw values for each level in a MultiIndex (closer to what you see), or category encoded values (closer to the MultiIndex implementation). To more concrete, here it what these look like for an example MultiIndex: ``` In [1]: index = pd.MultiIndex.from_product([[1, 2, 3], ['a', 'b']], names=['numbers', 'letters']) In [2]: index Out[2]: MultiIndex(levels=[[1, 2, 3], ['a', 'b']], labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]], names=['numbers', 'letters']) In [3]: index.values Out[3]: array([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'b')], dtype=object) categorical encoded valuesIn [4]: index.levels, index.labels Out[4]: (FrozenList([[1, 2, 3], ['a', 'b']]), FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])) raw valuesIn [5]: index.get_level_values(0), index.get_level_values(1) Out[5]: (Int64Index([1, 1, 2, 2, 3, 3], dtype='int64', name='numbers'), Index(['a', 'b', 'a', 'b', 'a', 'b'], dtype='object', name='letters')) ``` Advantages of storing raw values: - It's easier to work with MultiIndex levels without xarray, or with older versions of xarray (no need to combine levels and labels). - Avoiding the overhead of saving integer codes can save memory if levels have dtypes with small fixed sizes (e.g., float, int or datetime) or mostly distinct values. Advantages of storing category encoded values: - It's cheaper to construct the MultiIndex, because we have already factorized each level. - It can result in significant memory savings if levels are mostly duplicated (e.g., a tensor product) or have large itemsize (e.g., long strings). - We can restore the exact same MultiIndex, instead of refactorizing it. This manifests itself in a few edge cases that could make for a frustrating user experience (changed dimension order after stacking: https://github.com/pydata/xarray/issues/980). Perhaps the best approach would be a hybrid: store raw values, as well as de-duplicated integer codes specifying the order of values in each MultiIndex Either way, we will need to store an attribute or two with metadata for how to restore the levels (e.g., |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
MultiIndex serialization to NetCDF 187069161 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 4