home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where author_association = "MEMBER", issue = 314764258 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date)

user 1

  • shoyer · 5 ✖

issue 1

  • concat_dim getting added to *all* variables of multifile datasets · 5 ✖

author_association 1

  • MEMBER · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
512036050 https://github.com/pydata/xarray/issues/2064#issuecomment-512036050 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMjAzNjA1MA== shoyer 1217238 2019-07-16T23:09:24Z 2019-07-16T23:09:24Z MEMBER

UPDATE: @shoyer it could be that unit tests are failing because, as your final example shows, you get an error for data_vars='minimal' if any variables have different values across datasets, when adding a new concatentation dimension. If this is the reason so many unit tests are failing, then the failures are a red herring and should probably be ignored/rewritten.

This seems very likely to me. The existing behavior of data_vars='minimal' is only useful in "existing dimensions mode".

Xarray's unit test suite is definitely a good "smoke test" for understanding the impact of changes to concat on our users. What it tells us is that we can't change the default value from "all" to "minimal" without breaking existing code. Instead, we need to change how "all" or "minimal" works, or switch to yet another mode for the new behavior.

The tests we should feel free to rewrite are cases where we set data_vars="all" or data_vars="minimal" explicitly for verifying the weird edge behaviors that I noted in my earlier comments. There shouldn't be too many of these tests.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
512000102 https://github.com/pydata/xarray/issues/2064#issuecomment-512000102 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMjAwMDEwMg== shoyer 1217238 2019-07-16T21:44:52Z 2019-07-16T21:44:52Z MEMBER

Specifically, what should the default behavior of concat() be, when both datasets include a variable that does not include the concatenation dimension? Currently, the concat dimension is added, and the result is a "stacked" version of the variable. Others have argued that this variable should not be included in the concat() result by default, but this appears to break compatibility with Pandas concat().

Can you give a specific example of the behavior in question?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
511611430 https://github.com/pydata/xarray/issues/2064#issuecomment-511611430 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMTYxMTQzMA== shoyer 1217238 2019-07-15T23:54:47Z 2019-07-15T23:54:47Z MEMBER

The logic for determining which variables to concatenate is in the _calc_concat_over helper function: https://github.com/pydata/xarray/blob/539fb4a98d0961c281daa5474a8e492a0ae1d8a2/xarray/core/concat.py#L146

Only "different" is supposed to load variables into memory to determine which ones to concatenate.

Right now we also have "all" and "minimal" options: - "all" attempts to concatenate every variable that can be broadcast to a matching shape: https://github.com/pydata/xarray/blob/539fb4a98d0961c281daa5474a8e492a0ae1d8a2/xarray/core/concat.py#L188-L190 - "minimal" only concatenates variables that already have the matching dimension.

Recall that concat handles two types of concatenation: existing dimensions (corresponding to np.concatenate) and new dimensions (corresponding to np.stack). Currently, this is all done together in one messy codebase, but logically it would be cleaner to separate these modes into two separate function: - In "existing dimensions" mode: - "all" is currently broken, because it will also concatenate variables that don't have the dimension. - "minimal" does the right thing, concatenating only variables with the dimension. - In "new dimensions" mode: - "all" will add the dimension to all variables. - "minimal" raise an error if any variables have different values. If you're datasets have any data variables with different values at all, it raises an error. This is pretty much useless.

Here's my notebook testing this out: https://gist.github.com/shoyer/f44300eddda4f7c476c61f76d1df938b

So I'm thinking that we probably want to combine "all" and "minimal" into a single mode to use as the default, and remove the other behavior, which is either useless or broken. Maybe it would make sense to come up with a new name for this mode, and to make both "all" and "minimal" deprecated aliases for it? In the long term, this leaves only two "automatic" modes for xarray.concat, which should make things simpler for users trying to figure this out.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
381728814 https://github.com/pydata/xarray/issues/2064#issuecomment-381728814 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDM4MTcyODgxNA== shoyer 1217238 2018-04-16T19:55:24Z 2018-04-16T19:55:24Z MEMBER

I stand corrected. in 0.10.1, I also see the Time variable getting added to refBottomDepth when I open multiple files. So maybe this is not in fact a new problem but an existing issue that happened to behave as I expected only when opening a single file in previous versions. Sorry for not noticing that sooner.

OK, in that case I think #2048 was still the right change/bug-fix, making multi-file and single-file behavior consistent.

But you certainly have exposed a real issue here.

But this issue raises an important basic point: we might want different behavior for variables in which concat_dim is already a dimension vs. variables for which it is not.

Yes, we shouldn't implicitly add a new dimensions to variables in the case where the dimension already exists in the dataset. We only need the heuristics/comparisons when an entirely new dimension is being added.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
381707540 https://github.com/pydata/xarray/issues/2064#issuecomment-381707540 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDM4MTcwNzU0MA== shoyer 1217238 2018-04-16T18:42:06Z 2018-04-16T18:42:06Z MEMBER

What happens if you open multiple files with open_mfdataset(), e.g., for both January and February. Does it result in a dataset with the right dimensions on each variable?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1939.952ms · About: xarray-datasette