home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

9 rows where author_association = "MEMBER", issue = 187069161 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • shoyer · 9 ✖

issue 1

  • MultiIndex serialization to NetCDF · 9 ✖

author_association 1

  • MEMBER · 9 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
645142014 https://github.com/pydata/xarray/issues/1077#issuecomment-645142014 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDY0NTE0MjAxNA== shoyer 1217238 2020-06-17T04:28:56Z 2020-06-17T04:28:56Z MEMBER

It still isn't clear to me why this is a better representation for a MultiIndex than a sparse array.

I guess it could work fine for either, but we would need to pick a convention.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
644451622 https://github.com/pydata/xarray/issues/1077#issuecomment-644451622 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDY0NDQ1MTYyMg== shoyer 1217238 2020-06-16T00:00:40Z 2020-06-16T00:00:40Z MEMBER

I agree with @fujiisoup. I think this "compression-by-gathering" representation makes more sense for sparse arrays than for a MultiIndex, per se.

That said, MultiIndex and sparse arrays are basically two sides of the same idea. In the long term, it might make sense to try only support one of the two.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
478058340 https://github.com/pydata/xarray/issues/1077#issuecomment-478058340 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDQ3ODA1ODM0MA== shoyer 1217238 2019-03-29T16:15:22Z 2019-03-29T16:15:22Z MEMBER

Once we finish https://github.com/pydata/xarray/issues/1603, that may change our perspective here a little bit (and could indirectly solve this problem).

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
286176727 https://github.com/pydata/xarray/issues/1077#issuecomment-286176727 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI4NjE3NjcyNw== shoyer 1217238 2017-03-13T17:14:37Z 2017-03-13T17:14:37Z MEMBER

Let's recap the options, which I'll illustrate for the second level of my MultiIndex from above (https://github.com/pydata/xarray/issues/1077#issuecomment-258323743):

  1. "categories and codes": e.g., ['a', 'b'] and [0, 1, 0, 1, 0, 1]. Highest speed, low memory requirements, faithful round-trip to xarray/pandas, less obvious representation.
  2. "categories and values": e.g., ['a', 'b'] and ['a', 'b', 'a', 'b', 'a', 'b']. Moderate speed (need recreate codes), high memory requirements, faithful round-trip to xarray/pandas, more obvious representation (categories can be safely ignored).
  3. "raw values": e.g., ['a', 'b', 'a', 'b', 'a', 'b']. Moderate speed (only slightly slower than 2), high memory requirements (slightly better than 2), does not support completely faithful roundtrip, most obvious representation.
  4. "category codes and values": e.g., [0, 1] and ['a', 'b', 'a', 'b', 'a', 'b']. Moderate speed, high memory requirements, also does not support faithful roundtrip (it's possible for some levels to not be represented in the MultiIndex values), more obvious representation (like 2).

3 uses only slightly less memory than 2 and can be easily achieved with reset_index(), so I don't see a reason to support it for writing (read support would be fine).

4 looks like a faithful roundtrip, but actually isn't in some rare edge cases. That seems like a recipe for disaster, so it should be OK.

This leaves 1 and 2. Both are reasonably performant and roundtrip xarray objects with complete fidelity, so I would be happy with either them. In principle we could even support both, with an argument to switch between the modes (one would need to be the default).

My inclination is start with only supporting 1, because it has a potentially large advantage from a speed/memory perspective, and it's easy to achieve the "raw values" representation with .reset_index() (and convert back with .set_index()). If we do this, the documentation for writing netCDF files should definitely include a suggestion to consider using .reset_index() when distributing files not intended strictly for use by xarray users.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
260686932 https://github.com/pydata/xarray/issues/1077#issuecomment-260686932 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI2MDY4NjkzMg== shoyer 1217238 2016-11-15T16:16:47Z 2016-11-15T16:16:47Z MEMBER

DatasetNode feels a little too complex to me and disjoint from the rest of the package. I don't know when I would recommend using a DatasetNode to store data.

Also, as written I don't see any aspects that need to live in core xarray -- it seems that it can mostly be done with the external interface. So I would suggest the separate package.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
260156237 https://github.com/pydata/xarray/issues/1077#issuecomment-260156237 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI2MDE1NjIzNw== shoyer 1217238 2016-11-12T23:44:03Z 2016-11-12T23:44:03Z MEMBER

Maybe? A minimal class for managing groups in an open file could potentially have synergy with our backends system. Something more than that is probably out of scope. On Sat, Nov 12, 2016 at 1:00 PM tippetts notifications@github.com wrote:

Here's a new, related question: @shoyer https://github.com/shoyer , do you have any interest in adding a class to xarray that contains a hierarchical tree of Datasets, analogous to the groups in a netCDF or HDF5 file? Then opening or saving such an object would be an easy but powerful one-liner.

Or is that something you would rather leave to someone else's module?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1077#issuecomment-260148070, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1tnZrQ9IuRGuHPNlerQiK7v0-ak8ks5q9ijvgaJpZM4KoZZV .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
258563550 https://github.com/pydata/xarray/issues/1077#issuecomment-258563550 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI1ODU2MzU1MA== shoyer 1217238 2016-11-04T22:31:17Z 2016-11-04T22:31:17Z MEMBER

encodings is only in xarray's data model. Everything there gets converted into some detail of how the data is stored in a netcdf file. So I don't think we need to use it here, unless we want options for controlling how the MultiIndex is stored.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
258460719 https://github.com/pydata/xarray/issues/1077#issuecomment-258460719 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI1ODQ2MDcxOQ== shoyer 1217238 2016-11-04T15:22:12Z 2016-11-04T15:22:12Z MEMBER

Personally I'd vote for the category encoded values. If I make files with a newer xarray, I'll be reading them later with the same (or newer) xarray and I'd definitely want the exact MultiIndex back.

Point taken -- let's see what others think!

One consideration in favor of this is that it will soon be very easy to switch a MultiIndex back into separate coordinate variables, which could be our recommendation for how to save netCDF files for maximum portability.

The one thing I'm wondering is, what happens in an application like this if you select on one index (say, all data rows with region_name='FOOBAR-1') from the HDF5 file before doing anything else? Would it hard to make the MultiIndex/NetCDF reader smart enough not to reconstruct the whole MultiIndex before picking out the relevant rows?

We could do this, but note that we are contemplating switching xarray to always load indexes into memory eagerly, which would negate that advantage. See this PR and mailing list discussion: https://github.com/pydata/xarray/pull/1024#issuecomment-256114879 https://groups.google.com/forum/#!topic/xarray/dK2RHUls1nQ

Nuts and bolts questions: So each of index.levels would be easy to store as its own little DataArray, yeah? Then would each of the index.labels be in its own DataArray, or would you want them all in the same 2D DataArray?

pandas stores levels separately, automatically putting each of them in the smallest possible dtype (int8, int16, int32 or int64). So we also probably want to store them in separate 1D variables.

And then would the actual data in the original DataArray just have a generic integer index as a placeholder, to be replaced by the MultiIndex?

Just a note: for interacting with backends, we use Variable objects instead of DataArrays: http://xarray.pydata.org/en/stable/internals.html#variable-objects

This means that we don't need the generic integer placeholder index (which will also be going away shortly in general, see https://github.com/pydata/xarray/pull/1017).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161
258323743 https://github.com/pydata/xarray/issues/1077#issuecomment-258323743 https://api.github.com/repos/pydata/xarray/issues/1077 MDEyOklzc3VlQ29tbWVudDI1ODMyMzc0Mw== shoyer 1217238 2016-11-04T01:38:41Z 2016-11-04T01:38:56Z MEMBER

This is a good question -- I don't think we've figured it out yet. Maybe you have ideas?

The main question (to me) is whether we should store raw values for each level in a MultiIndex (closer to what you see), or category encoded values (closer to the MultiIndex implementation).

To more concrete, here it what these look like for an example MultiIndex:

``` In [1]: index = pd.MultiIndex.from_product([[1, 2, 3], ['a', 'b']], names=['numbers', 'letters'])

In [2]: index Out[2]: MultiIndex(levels=[[1, 2, 3], ['a', 'b']], labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]], names=['numbers', 'letters'])

In [3]: index.values Out[3]: array([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'b')], dtype=object)

categorical encoded values

In [4]: index.levels, index.labels Out[4]: (FrozenList([[1, 2, 3], ['a', 'b']]), FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]]))

raw values

In [5]: index.get_level_values(0), index.get_level_values(1) Out[5]: (Int64Index([1, 1, 2, 2, 3, 3], dtype='int64', name='numbers'), Index(['a', 'b', 'a', 'b', 'a', 'b'], dtype='object', name='letters')) ```

Advantages of storing raw values: - It's easier to work with MultiIndex levels without xarray, or with older versions of xarray (no need to combine levels and labels). - Avoiding the overhead of saving integer codes can save memory if levels have dtypes with small fixed sizes (e.g., float, int or datetime) or mostly distinct values.

Advantages of storing category encoded values: - It's cheaper to construct the MultiIndex, because we have already factorized each level. - It can result in significant memory savings if levels are mostly duplicated (e.g., a tensor product) or have large itemsize (e.g., long strings). - We can restore the exact same MultiIndex, instead of refactorizing it. This manifests itself in a few edge cases that could make for a frustrating user experience (changed dimension order after stacking: https://github.com/pydata/xarray/issues/980).

Perhaps the best approach would be a hybrid: store raw values, as well as de-duplicated integer codes specifying the order of values in each MultiIndex level. This will be a little slower than just storing the raw values, but has the correctness guarantee provided by storing category encoded values.

Either way, we will need to store an attribute or two with metadata for how to restore the levels (e.g., 'multiindex_levels: numbers letters').

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  MultiIndex serialization to NetCDF 187069161

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 226.822ms · About: xarray-datasette