home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where issue = 1376109308 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 8

  • shoyer 2
  • benbovy 2
  • dcherian 1
  • max-sixty 1
  • kmuehlbauer 1
  • aaronspring 1
  • SimonHeybrock 1
  • lamorton 1

author_association 3

  • MEMBER 7
  • NONE 2
  • CONTRIBUTOR 1

issue 1

  • Should Xarray stop doing automatic index-based alignment? · 10 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1326262197 https://github.com/pydata/xarray/issues/7045#issuecomment-1326262197 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85PDSe1 benbovy 4160723 2022-11-24T10:35:02Z 2022-11-24T10:35:02Z MEMBER

I find the analogy with relational databases quite meaningful!

Rectangular grids likely have been the primary use case in Xarray for a long time, but I wonder to which extent it is the case nowadays. Probably a good question to ask for the next user survey?

Interestingly, the 2021 user survey results (*) show that "interoperability with pandas" is not a critical feature while "label-based indexing, interpolation, groupby, reindexing, etc." is most important, although the description of the latter is rather broad. It would be interesting to compute the correlation between these two variables. The results also show that "more flexible indexing (selection, alignment)" is very useful or critical for 2/3 of the participants.

Not sure how to interpret those results within the context of this discussion, though.

(*) The 2022 user survey results doesn't show significant differences in general

suppose one could in principle have an array with coordinates such that none of the coordinates aligned with any particular axis, but it seems improbable.

Not that improbable for unstructured meshes, curvilinear grids, staggered grids, etc. Xarray is often chosen to handle them too (e.g., uxarray, xgcm).

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308
1324489293 https://github.com/pydata/xarray/issues/7045#issuecomment-1324489293 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85O8hpN lamorton 23484003 2022-11-23T03:05:50Z 2022-11-23T03:06:57Z NONE

IMO nearly all the complication and confusion emerge from the mixed concept of a dimension coordinate in the Xarray data model.

My take: the main confusion is from trying to support a relational-database-like data model (where inner/outer joins make sense because values are discrete/categorical) AND a multi-dimensional array model for physical sciences (where typically values are floating-point, exact alignment is required, and interpolation is used when alignment is inexact). As a physical sciences guy, I basically never use the database-like behavior, and it only serves to silence alignment errors so that the fallout happens downstream (NaNs from outer joins, empty arrays on inner joins), making it harder to debug. TIL I can just xarray.set_options(arithmetic_join='exact') and get what I wanted all along.

Why can't we use loc/sel with a non-dimension (non-index) coord?

What happens if I have Cartesian x/y dimensions plus r/theta cylindrical coordinates defined on the x / y, and I select some range in r? It's not slicing an array at that point, that's more like a relational database query. The thing you get back isn't an array anymore because not all i,j combinations are valid.

confusion emerge[s] from the mixed concept of a dimension coordinate

From my perspective, the dimensions are special coordinates that the arrays happen to be sampled in a rectangular grid on. It's not confusing to me, but maybe that's b/c of my perspective from physical sciences background/usecases. I suppose one could in principle have an array with coordinates such that none of the coordinates aligned with any particular axis, but it seems improbable.

What do you think of making the default FloatIndex use a reasonable (hard to define!) rtol for comparisons?

IMO this is asking for weird bugs. In my work I either expect exact alignment, or I want to interpolate. I never want to ignore a mismatch because it's basically just sweeping an error under the rug. In fact, I'd really just like to test that all the dimension coordinates are the same objects, although Python's semantics don't really work with that.

imagine cases where a coordinate is defined in separate units.

Getting this right would be really powerful.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308
1251975597 https://github.com/pydata/xarray/issues/7045#issuecomment-1251975597 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85Kn6Gt benbovy 4160723 2022-09-20T07:51:45Z 2022-09-20T07:51:45Z MEMBER

So maybe the question here is whether such an ArrayIndex should be the default?

Another solution for more flexibility or a smooth transition may be to add a build option to the Index base class API, so that it would be possible for the current default PandasIndex or any custom index to easily (and explicitly) deactivate automatic alignment while keeping it around for label-based selection.

Indexes (including alignment behavior) feel like a massive complication of Xarray, both conceptually (which includes documentation and teaching efforts) as well as code.

I agree, although this is getting addressed slowly but surely. In Xarray internals, most of the indexes logic is now in the core.indexes module. For the public API #4366, #6849 and #6971 will ultimately make things better. Object reprs are important too (#6795). There is still a good amount of work in order to improve the documentation, some of it is discussed in #6975.

IMO nearly all the complication and confusion emerge from the mixed concept of a dimension coordinate in the Xarray data model. Once the concept of an index is clearly decoupled from the concept of a coordinate and both concepts are represented as 1st-class citizens, it will help users focusing on the parts of the API and/or documentation that are relevant to their needs. It will also help "selling" Xarray to users who don't need much of the index capabilities (this has been discussed several times, either as external feedback or between Xarray devs, e.g., proposal of a "xarray-lite" package). Finally it will make more affordable major changes such as the one proposed here by @shoyer.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308
1251836989 https://github.com/pydata/xarray/issues/7045#issuecomment-1251836989 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85KnYQ9 SimonHeybrock 12912489 2022-09-20T04:48:07Z 2022-09-20T06:13:32Z NONE

This suggestion looks roughly like what we are discussing in https://github.com/pydata/xarray/discussions/7041#discussioncomment-3662179, i.e., using a custom index that avoids this? So maybe the question here is whether such an ArrayIndex should be the default?

Aside from that, with my outside perspective (having used Xarray extremely little, looking at the docs and code occasionally, but developing a similar library that does not have indexes):

Indexes (including alignment behavior) feel like a massive complication of Xarray, both conceptually (which includes documentation and teaching efforts) as well as code. If all you require is the ArrayIndex behavior (i.e., exact coord comparison in operations) then the entire concept of indexes is just ballast, distraction in the documentation, and confusion. Example: Why can't we use loc/sel with a non-dimension (non-index) coord? --- without index we would just search the coord with no need to limit this to index-coords, and this is often fast enough?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308
1250007817 https://github.com/pydata/xarray/issues/7045#issuecomment-1250007817 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85KgZsJ kmuehlbauer 5821660 2022-09-17T05:55:58Z 2022-09-17T05:55:58Z MEMBER

I still find myself struggling to understand which of those options are needed for my use cases (inner, outer etc.). Default is working in many cases, but in other cases it is trial and error.

In that sense this proposal would make me have to really understand what's going on.

The suggestion of another mode by @max-sixty just made me think, if this automatic alignment machinery could be moved to another package. If that package is installed the current behaviour is preserved, if not then the new behaviour proposed by @shoyer comes into play.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308
1249929257 https://github.com/pydata/xarray/issues/7045#issuecomment-1249929257 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85KgGgp max-sixty 5635139 2022-09-16T23:14:26Z 2022-09-16T23:14:26Z MEMBER

I think I really empathize with the pain here. There's a very real explicitness vs "helpfulness" tradeoff, often depending on whether people are doing exploratory research vs hardened production (a bit like Ask vs Guess culture!).

But from the perspective of someone who works with lots of people who use Xarray for their daily research, I think this would be a big hurdle, even without considering the change costs.

One analogy is xarray vs. pandas for 2D data — among my colleagues xarray is known to be a smaller, more reliable API surface, while pandas is more fully featured but also a maze of surprising methods and behavior (df['a'] * df!). Forcing explicit alignment would strengthen that case. But it could take it too far — operations that you expect to just work would now need nannying.

"Make another mode" can seem like an easy decision — "who doesn't want another mode" — but it could make development more difficult, since we'd need calls to check which mode we're in & tests for those. It's not insurmountable though, and maybe it would only be required in a couple of methods, so testing those would be sufficient to ensure the resulting behavior would be correct?

(FWIW we don't use float indexes, so it could be fine to dispense with those)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308
1249910951 https://github.com/pydata/xarray/issues/7045#issuecomment-1249910951 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85KgCCn shoyer 1217238 2022-09-16T22:26:36Z 2022-09-16T22:26:36Z MEMBER

As a concrete example, suppose we have two datasets: 1. Hourly predictions for 10 days 2. Daily observations for a month.

```python import numpy as np import pandas as pd import xarray

predictions = xarray.DataArray( np.random.RandomState(0).randn(24*10), {'time': pd.date_range('2022-01-01', '2022-01-11', freq='1h', closed='left')}, ) observations = xarray.DataArray( np.random.RandomState(1).randn(31), {'time': pd.date_range('2022-01-01', '2022-01-31', freq='24h')}, ) ```

Today, if you compare these datasets, they automatically align: ```

predictions - observations <xarray.DataArray (time: 10)> array([ 0.13970698, 2.88151104, -1.0857261 , 2.21236931, -0.85490761, 2.67796423, 0.63833301, 1.94923669, -0.35832191, 0.23234996]) Coordinates: * time (time) datetime64[ns] 2022-01-01 2022-01-02 ... 2022-01-10 ```

With this proposed change, you would get an error, e.g., something like: ```

predictions - observations ValueError: xarray objects are not aligned along dimension 'time':
array(['2022-01-01T00:00:00.000000000', '2022-01-02T00:00:00.000000000', '2022-01-03T00:00:00.000000000', '2022-01-04T00:00:00.000000000', '2022-01-05T00:00:00.000000000', '2022-01-06T00:00:00.000000000', '2022-01-07T00:00:00.000000000', '2022-01-08T00:00:00.000000000', '2022-01-09T00:00:00.000000000', '2022-01-10T00:00:00.000000000', '2022-01-11T00:00:00.000000000', '2022-01-12T00:00:00.000000000', '2022-01-13T00:00:00.000000000', '2022-01-14T00:00:00.000000000', '2022-01-15T00:00:00.000000000', '2022-01-16T00:00:00.000000000', '2022-01-17T00:00:00.000000000', '2022-01-18T00:00:00.000000000', '2022-01-19T00:00:00.000000000', '2022-01-20T00:00:00.000000000', '2022-01-21T00:00:00.000000000', '2022-01-22T00:00:00.000000000', '2022-01-23T00:00:00.000000000', '2022-01-24T00:00:00.000000000', '2022-01-25T00:00:00.000000000', '2022-01-26T00:00:00.000000000', '2022-01-27T00:00:00.000000000', '2022-01-28T00:00:00.000000000', '2022-01-29T00:00:00.000000000', '2022-01-30T00:00:00.000000000', '2022-01-31T00:00:00.000000000'], dtype='datetime64[ns]') vs array(['2022-01-01T00:00:00.000000000', '2022-01-01T01:00:00.000000000', '2022-01-01T02:00:00.000000000', ..., '2022-01-10T21:00:00.000000000', '2022-01-10T22:00:00.000000000', '2022-01-10T23:00:00.000000000'], dtype='datetime64[ns]') ```

Instead, you would need to manually align these objects, e.g., with xarray.align, reindex_like() or interp_like(), e.g., ```

predictions, observations = xarray.align(predictions, observations) or observations = observations.reindex_like(predictions) or predictions = predictions.interp_like(observations) ```

To (partially) simulate the effect of this change on a codebase today, you could write xarray.set_options(arithmetic_join='exact') -- but presmably it would also make sense to change Xarray's other alignment code (e.g., in concat and merge).

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308
1249758907 https://github.com/pydata/xarray/issues/7045#issuecomment-1249758907 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85Kfc67 aaronspring 12237157 2022-09-16T20:06:21Z 2022-09-16T20:06:21Z CONTRIBUTOR

@shoyer could you maybe provide a code example of the current index aligned behaviour and a future not index aligned behaviour?

I am a bit worried about transitioning previous code bases to such new xarray releases

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308
1249601076 https://github.com/pydata/xarray/issues/7045#issuecomment-1249601076 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85Ke2Y0 shoyer 1217238 2022-09-16T17:16:52Z 2022-09-16T17:18:38Z MEMBER

IMO we could first align (hah) these choices to be the same:

the exact mode of automatic alignment (outer vs inner vs left join) depends on the specific operation.

The problem is that user expectations are actually rather different for different options:

  • With data movement operations like xarray.merge, you expect to keep around all existing data -- so you want an outer join.
  • With inplace operations that modify an existing Dataset, e.g., by adding new variables, you don't expect the existing coordinates to change -- so you want a left join.
  • With computate based operations (like arithmatic), you don't have an expectation that all existing data is unmodified, so keeping around a bunch of NaN values felt very wasteful -- hence the inner join.

What do you think of making the default FloatIndex use a reasonable (hard to define!) rtol for comparisons?

This would definitely be a step forward! However, it's a tricky nut to crack. We would both need a heuristic for defining rtol (some fraction of coordinate spacing?) and a method for deciding what the resulting coordinates should be (use values from the first object?).

Even then, automatic alignment is often problematic, e.g., imagine cases where a coordinate is defined in separate units.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308
1249580349 https://github.com/pydata/xarray/issues/7045#issuecomment-1249580349 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85KexU9 dcherian 2448579 2022-09-16T16:51:55Z 2022-09-16T16:51:55Z MEMBER

I think I agree here but a lot of things are going to break.

IMO we could first align (hah) these choices to be the same:

the exact mode of automatic alignment (outer vs inner vs left join) depends on the specific operation.

so that they're all controlled by OPTIONS["arithmetic_join"] (rename to "default_join"?) and then change the default after a long period of warnings.

Automatic alignment is not useful for float indexes, because exact matches are rare. In practice, this makes it less useful in Xarray's usual domains than it for pandas.

What do you think of making the default FloatIndex use a reasonable (hard to define!) rtol for comparisons?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 638.603ms · About: xarray-datasette