home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

15 rows where issue = 646716560 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • shoyer 8
  • Li9htmare 5
  • fujiisoup 2

author_association 2

  • MEMBER 10
  • NONE 5

issue 1

  • to_xarray() result is incorrect when one of multi-index levels is not sorted · 15 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
652064154 https://github.com/pydata/xarray/issues/4186#issuecomment-652064154 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MjA2NDE1NA== Li9htmare 15720911 2020-06-30T21:48:33Z 2020-06-30T21:48:33Z NONE

This intention of variables used constructing the Dataset looks a lot clearer now. Many thanks Stephan!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
652032780 https://github.com/pydata/xarray/issues/4186#issuecomment-652032780 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MjAzMjc4MA== shoyer 1217238 2020-06-30T20:44:00Z 2020-06-30T20:44:00Z MEMBER

My concern was when another person works on this and didn't get the context that idx might be different from dataframe.index and new bugs could potentially be introduced

Let me see if I can rewrite the helper functions to avoid passing around a DataFrame

This was a good suggestion. Done in https://github.com/pydata/xarray/pull/4184/commits/96b544b5a59894359a35680151af71c0226f0505

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
652018527 https://github.com/pydata/xarray/issues/4186#issuecomment-652018527 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MjAxODUyNw== shoyer 1217238 2020-06-30T20:13:44Z 2020-06-30T20:13:44Z MEMBER

My concern was when another person works on this and didn't get the context that idx might be different from dataframe.index and new bugs could potentially be introduced

Let me see if I can rewrite the helper functions to avoid passing around a DataFrame

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
651984472 https://github.com/pydata/xarray/issues/4186#issuecomment-651984472 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MTk4NDQ3Mg== Li9htmare 15720911 2020-06-30T19:02:28Z 2020-06-30T19:02:28Z NONE

Sorry @shoyer, I didn't notice you have pushed new commits to #4184 and thought you meant to just remove the DataFrame.set_index. Your latest commits indeed give the correct result. My concern was when another person works on this and didn't get the context that idx might be different from dataframe.index and new bugs could potentially be introduced. Though consider the limited scope where we are maintaining both idx and dataframe, I guess it should be fine.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
651905098 https://github.com/pydata/xarray/issues/4186#issuecomment-651905098 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MTkwNTA5OA== shoyer 1217238 2020-06-30T16:29:10Z 2020-06-30T16:44:02Z MEMBER

@Li9htmare I'm not sure I follow your example. #4184 does remove the use of DataFrame.set_index(), but it also removes any subsequent use of dataframe.index -- it always uses the separately processed index.

Is there something specific that you are worried about going wrong with your latest example? For what it's worth, here's what to_xarray() does with the current version of #4184: ``` In [4]: df.to_xarray() Out[4]: <xarray.Dataset> Dimensions: (lev1: 2, lev2: 1) Coordinates: * lev1 (lev1) object 'b' 'a' * lev2 (lev2) object 'foo' Data variables: C1 (lev1, lev2) int64 0 2 C2 (lev1, lev2) int64 1 3

In [5]: df.to_xarray().indexes Out[5]: lev1: CategoricalIndex(['b', 'a'], categories=['b', 'a'], ordered=True, name='lev1', dtype='category') lev2: Index(['foo'], dtype='object', name='lev2') ```

I think this is doing the right thing already?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
651674763 https://github.com/pydata/xarray/issues/4186#issuecomment-651674763 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MTY3NDc2Mw== Li9htmare 15720911 2020-06-30T09:24:13Z 2020-06-30T09:24:13Z NONE

Hi @shoyer , without dataframe.set_index(), dataframe.index can potentially be different from idx returned by remove_unused_levels_categories, this will lead to other problems. One example is the following df: df = pd.DataFrame( { 'lev1': pd.Series( ['b', 'a'], dtype=pd.CategoricalDtype(['c', 'b', 'a'], ordered=True) ), 'lev2': 'foo', 'C1': [0, 2], 'C2': [1, 3], } ).set_index(['lev1', 'lev2'])

I agree it will be better if we can maintain the order from df to xr.Dataset, but I think we should never work with a copy of idx which is different from dataframe.index, as this will lead to hard to debug problems due to "surprising" behavior pandas does.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
651467248 https://github.com/pydata/xarray/issues/4186#issuecomment-651467248 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MTQ2NzI0OA== shoyer 1217238 2020-06-30T01:41:36Z 2020-06-30T01:41:36Z MEMBER

The sorting seems to be a separate matter, caused by dataframe.set_index() inside our remove_unused_levels_categories function. I think we can remove that, which will fix the sorting issue when removing unused levels. Then the result will be the desired: df.to_xarray() <xarray.Dataset> Dimensions: (lev1: 2, lev2: 1) Coordinates: * lev1 (lev1) object 'b' 'a' * lev2 (lev2) object 'foo' Data variables: C1 (lev1, lev2) int64 0 2 C2 (lev1, lev2) int64 1 3

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
651458105 https://github.com/pydata/xarray/issues/4186#issuecomment-651458105 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MTQ1ODEwNQ== shoyer 1217238 2020-06-30T01:14:45Z 2020-06-30T01:14:45Z MEMBER

Actually, I realize now that this is basically the same issue as https://github.com/pydata/xarray/issues/2619

If I remove the use of removed_unused_levels_categories from from_dataframe, then I get the same behavior that we considered a bug in that issue: In [5]: ds.isel(xy=ds['x'] < 4).to_pandas().to_xarray() Out[5]: <xarray.DataArray (x: 8, y: 5)> array([[ 0., 1., 2., 3., 4.], [ 5., 6., 7., 8., 9.], [10., 11., 12., 13., 14.], [15., 16., 17., 18., 19.], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan]]) Coordinates: * x (x) int64 0 1 2 3 4 5 6 7 * y (y) int64 0 1 2 3 4

So maybe it is more consistent to keep calling remove_unused_levels(), which somewhat surprisingly sorts MultiIndex levels.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
651454795 https://github.com/pydata/xarray/issues/4186#issuecomment-651454795 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MTQ1NDc5NQ== fujiisoup 6815844 2020-06-30T01:06:34Z 2020-06-30T01:06:34Z MEMBER

I agree that it's better not to sort.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
651453863 https://github.com/pydata/xarray/issues/4186#issuecomment-651453863 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MTQ1Mzg2Mw== shoyer 1217238 2020-06-30T01:03:40Z 2020-06-30T01:03:40Z MEMBER

I verified that #4184 fixes the tests added for #3953 even after removing the call to remove_unused_levels_categories().

The main question is what behavior we want to do have: Should from_dataframe preserve index levels exactly, or should it sort them first?

I think it's better to not to sort (but of course it's better to sort than to get the wrong order).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
651438776 https://github.com/pydata/xarray/issues/4186#issuecomment-651438776 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MTQzODc3Ng== fujiisoup 6815844 2020-06-30T00:21:43Z 2020-06-30T00:21:43Z MEMBER

I think the #3953 fixes the case where the multiindex has unused levels. I had no better idea than #3953, but if it works without #3953, it would be better ;)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
651428394 https://github.com/pydata/xarray/issues/4186#issuecomment-651428394 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MTQyODM5NA== shoyer 1217238 2020-06-29T23:51:49Z 2020-06-29T23:51:49Z MEMBER

Thanks for clarifying!

This raises an interesting question for #4184: do we want to keep @fujiisoup's fix from #3953 or not?

If we remove @fujiisoup's fix, then the output we see is: df.to_xarray() <xarray.Dataset> Dimensions: (lev1: 2, lev2: 1) Coordinates: * lev1 (lev1) object 'b' 'a' * lev2 (lev2) object 'foo' Data variables: C1 (lev1, lev2) int64 0 2 C2 (lev1, lev2) int64 1 3

This is also correct -- coordinates match up with values -- but the order of the result is different from what is currently on master.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
651424721 https://github.com/pydata/xarray/issues/4186#issuecomment-651424721 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MTQyNDcyMQ== Li9htmare 15720911 2020-06-29T23:40:41Z 2020-06-29T23:41:45Z NONE

Hi @shoyer, sorry I got you confused, I should have run your code at first place. You code removes the problematic dataframe.reindex in Dataset._set_numpy_data_from_dataframe, but there is indeed another place causing the problem, which is actually already fixed (but not released yet) by https://github.com/pydata/xarray/pull/3953/files#diff-921db548d18a549f6381818ed08298c9L4607-L4608

Using pzhlobi's example df with xarray 0.15.1 (incorrect result): df.to_xarray() <xarray.Dataset> Dimensions: (lev1: 2, lev2: 1) Coordinates: * lev1 (lev1) object 'b' 'a' * lev2 (lev2) object 'foo' Data variables: C1 (lev1, lev2) int64 2 0 C2 (lev1, lev2) int64 3 1

Using the same df with both #3953 and #4184 (correct result): df.to_xarray() <xarray.Dataset> Dimensions: (lev1: 2, lev2: 1) Coordinates: * lev1 (lev1) object 'a' 'b' * lev2 (lev2) object 'foo' Data variables: C1 (lev1, lev2) int64 2 0 C2 (lev1, lev2) int64 3 1

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
651402838 https://github.com/pydata/xarray/issues/4186#issuecomment-651402838 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MTQwMjgzOA== shoyer 1217238 2020-06-29T22:28:00Z 2020-06-29T22:28:00Z MEMBER

Hi @pzhlobi @Li9htmare -- thanks for raising this issue.

Could you kindly clarify for me exactly what behavior you think xarray should do? The results are indeed reordered currently, but as far as I can tell the pairing between coordinators and values remains consistent.

When I test this myself, I see the same behavior (documented in the first post) either with or without my changes from #4184.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560
650738680 https://github.com/pydata/xarray/issues/4186#issuecomment-650738680 https://api.github.com/repos/pydata/xarray/issues/4186 MDEyOklzc3VlQ29tbWVudDY1MDczODY4MA== Li9htmare 15720911 2020-06-28T11:37:20Z 2020-06-28T11:37:20Z NONE

It seems the problem here is in Dataset.from_dataframe the dims and coords are created with df.index.levels which is unsorted: https://github.com/pydata/xarray/blob/732750a06aef2025b206ba6ff765f5acc53bfa25/xarray/core/dataset.py#L4642-L4643

Then in Dataset._set_numpy_data_from_dataframe, the pd.MultiIndex.from_product and dataframe.reindex unintentionally sort the dataframe by index: https://github.com/pydata/xarray/blob/732750a06aef2025b206ba6ff765f5acc53bfa25/xarray/core/dataset.py#L4588-L4589

Besides the perf improvement it provides, #4184 seems also have a nice side effect fixing this issue.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_xarray() result is incorrect when one of multi-index levels is not sorted 646716560

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 11.076ms · About: xarray-datasette