home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

14 rows where author_association = "MEMBER" and issue = 314764258 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • shoyer 5
  • dcherian 5
  • rabernat 4

issue 1

  • concat_dim getting added to *all* variables of multifile datasets · 14 ✖

author_association 1

  • MEMBER · 14 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
531818131 https://github.com/pydata/xarray/issues/2064#issuecomment-531818131 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUzMTgxODEzMQ== dcherian 2448579 2019-09-16T15:03:12Z 2019-09-16T15:03:12Z MEMBER

#3239 has been merged. Now minimal is more useful since you can specify compat="override" to skip compatibility checking.

What's left is to change defaults to implement @shoyer's comment

So I'm thinking that we probably want to combine "all" and "minimal" into a single mode to use as the default, and remove the other behavior, which is either useless or broken. Maybe it would make sense to come up with a new name for this mode, and to make both "all" and "minimal" deprecated aliases for it? In the long term, this leaves only two "automatic" modes for xarray.concat, which should make things simpler for users trying to figure this out.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
524021001 https://github.com/pydata/xarray/issues/2064#issuecomment-524021001 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUyNDAyMTAwMQ== dcherian 2448579 2019-08-22T18:22:37Z 2019-08-22T18:22:37Z MEMBER

Thanks for your input @bonnland.

The pandas concat() function uses the option join = {'inner', 'outer', 'left', 'right'} in order to mimic logical database join operations. If there is a reason that xarray cannot do the same, it is not obvious to me. I think the pandas options have the advantage of logical simplicity and traditional usage within database systems.

We do have a join argument that takes these arguments + 'override' which was added recently to skip expensive comparisons. This works for "indexes" or "dimension coordinates". An example: if you have 2 dataarrays, one on a coordinate x=[1, 2, 3] and the other on x=[2,3,4], join lets you control the x coordinate of the output. This is done by xr.align.

What's under discussion here is what to do about variables duplicated across datasets or indeed, how do we know that these variables are duplicated across datasets when concatenating other variables.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
523960862 https://github.com/pydata/xarray/issues/2064#issuecomment-523960862 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUyMzk2MDg2Mg== dcherian 2448579 2019-08-22T15:42:10Z 2019-08-22T15:42:10Z MEMBER

I have a draft solution in #3239. It adds a new mode called "sensible" that acts like "all" when the concat dimension doesn't exist in the dataset and acts like "minimal" when the dimension is present. We can decide whether this is the right way i.e. add a new mode but the more fundamental problem is below.

The issue is dealing with variables that should not be concatentated in "minimal" mode (e.g. time-invariant non dim coords when concatenating in time). In this case, we want to skip the equality checks in _calc_concat_over. This is a common reason for poor open_mfdataset performance.

I thought the clean way to do this would be to add the compat kwarg to concat and then add compat='override' since the current behaviour is effectively compat='equals'.

However, merge takes compat too and concat and merge support different compat arguments at present. This makes it complicated to easily thread compat down from combine or open_mfdataset without adding concat_compat and merge_compat which is silly.

So do we want to support all the other compat modes in concat? Things like broadcast_equals or no_conflicts are funny because they're basically merge operations and it means concat acts like both stack, concat and merge. OTOH if you have a set of variables with the same name from different datasets and you want to pick one of those (i.e. no concatenation), then you're basically doing merge anyway. This would require some refactoring since concat assumes the first dataset is a template for the rest.

@shoyer What do you think?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
519149757 https://github.com/pydata/xarray/issues/2064#issuecomment-519149757 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxOTE0OTc1Nw== dcherian 2448579 2019-08-07T15:32:16Z 2019-08-07T15:32:16Z MEMBER

Maybe it would make sense to come up with a new name for this mode, and to make both "all" and "minimal" deprecated aliases for it?

I'm in favour of this. What should we name this mode?

One comment on "existing dimensions" mode:

  • "minimal" does the right thing, concatenating only variables with the dimension.

For variables without the dimension, this will still raise a ValueError because compat can only be 'equals' or 'identical'. It seems to me like we need compat='override' and/or compat='tolerance', tolerance=... that would use numpy's approximate equality testing. This checking of non-dimensional coordinates is a common source of mfdataset issues. What do you think?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
512036050 https://github.com/pydata/xarray/issues/2064#issuecomment-512036050 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMjAzNjA1MA== shoyer 1217238 2019-07-16T23:09:24Z 2019-07-16T23:09:24Z MEMBER

UPDATE: @shoyer it could be that unit tests are failing because, as your final example shows, you get an error for data_vars='minimal' if any variables have different values across datasets, when adding a new concatentation dimension. If this is the reason so many unit tests are failing, then the failures are a red herring and should probably be ignored/rewritten.

This seems very likely to me. The existing behavior of data_vars='minimal' is only useful in "existing dimensions mode".

Xarray's unit test suite is definitely a good "smoke test" for understanding the impact of changes to concat on our users. What it tells us is that we can't change the default value from "all" to "minimal" without breaking existing code. Instead, we need to change how "all" or "minimal" works, or switch to yet another mode for the new behavior.

The tests we should feel free to rewrite are cases where we set data_vars="all" or data_vars="minimal" explicitly for verifying the weird edge behaviors that I noted in my earlier comments. There shouldn't be too many of these tests.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
512000102 https://github.com/pydata/xarray/issues/2064#issuecomment-512000102 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMjAwMDEwMg== shoyer 1217238 2019-07-16T21:44:52Z 2019-07-16T21:44:52Z MEMBER

Specifically, what should the default behavior of concat() be, when both datasets include a variable that does not include the concatenation dimension? Currently, the concat dimension is added, and the result is a "stacked" version of the variable. Others have argued that this variable should not be included in the concat() result by default, but this appears to break compatibility with Pandas concat().

Can you give a specific example of the behavior in question?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
511611430 https://github.com/pydata/xarray/issues/2064#issuecomment-511611430 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMTYxMTQzMA== shoyer 1217238 2019-07-15T23:54:47Z 2019-07-15T23:54:47Z MEMBER

The logic for determining which variables to concatenate is in the _calc_concat_over helper function: https://github.com/pydata/xarray/blob/539fb4a98d0961c281daa5474a8e492a0ae1d8a2/xarray/core/concat.py#L146

Only "different" is supposed to load variables into memory to determine which ones to concatenate.

Right now we also have "all" and "minimal" options: - "all" attempts to concatenate every variable that can be broadcast to a matching shape: https://github.com/pydata/xarray/blob/539fb4a98d0961c281daa5474a8e492a0ae1d8a2/xarray/core/concat.py#L188-L190 - "minimal" only concatenates variables that already have the matching dimension.

Recall that concat handles two types of concatenation: existing dimensions (corresponding to np.concatenate) and new dimensions (corresponding to np.stack). Currently, this is all done together in one messy codebase, but logically it would be cleaner to separate these modes into two separate function: - In "existing dimensions" mode: - "all" is currently broken, because it will also concatenate variables that don't have the dimension. - "minimal" does the right thing, concatenating only variables with the dimension. - In "new dimensions" mode: - "all" will add the dimension to all variables. - "minimal" raise an error if any variables have different values. If you're datasets have any data variables with different values at all, it raises an error. This is pretty much useless.

Here's my notebook testing this out: https://gist.github.com/shoyer/f44300eddda4f7c476c61f76d1df938b

So I'm thinking that we probably want to combine "all" and "minimal" into a single mode to use as the default, and remove the other behavior, which is either useless or broken. Maybe it would make sense to come up with a new name for this mode, and to make both "all" and "minimal" deprecated aliases for it? In the long term, this leaves only two "automatic" modes for xarray.concat, which should make things simpler for users trying to figure this out.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
511468454 https://github.com/pydata/xarray/issues/2064#issuecomment-511468454 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMTQ2ODQ1NA== dcherian 2448579 2019-07-15T16:15:51Z 2019-07-15T16:15:51Z MEMBER

@bonnland I don't think you want to change the default data_vars but instead update the heuristics as in this comment

we shouldn't implicitly add a new dimensions to variables in the case where the dimension already exists in the dataset. We only need the heuristics/comparisons when an entirely new dimension is being added.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
381975937 https://github.com/pydata/xarray/issues/2064#issuecomment-381975937 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDM4MTk3NTkzNw== rabernat 1197350 2018-04-17T12:34:15Z 2018-04-17T12:34:15Z MEMBER

I'm glad!

FWIW, I think this is a relatively simple fix within xarray. @xylar, if you are game, we would love to see a PR from you. Could be a good opportunity to learn more about xarray internals.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
381728814 https://github.com/pydata/xarray/issues/2064#issuecomment-381728814 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDM4MTcyODgxNA== shoyer 1217238 2018-04-16T19:55:24Z 2018-04-16T19:55:24Z MEMBER

I stand corrected. in 0.10.1, I also see the Time variable getting added to refBottomDepth when I open multiple files. So maybe this is not in fact a new problem but an existing issue that happened to behave as I expected only when opening a single file in previous versions. Sorry for not noticing that sooner.

OK, in that case I think #2048 was still the right change/bug-fix, making multi-file and single-file behavior consistent.

But you certainly have exposed a real issue here.

But this issue raises an important basic point: we might want different behavior for variables in which concat_dim is already a dimension vs. variables for which it is not.

Yes, we shouldn't implicitly add a new dimensions to variables in the case where the dimension already exists in the dataset. We only need the heuristics/comparisons when an entirely new dimension is being added.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
381725478 https://github.com/pydata/xarray/issues/2064#issuecomment-381725478 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDM4MTcyNTQ3OA== rabernat 1197350 2018-04-16T19:44:00Z 2018-04-16T19:44:00Z MEMBER

But this issue raises an important basic point: we might want different behavior for variables in which concat_dim is already a dimension vs. variables for which it is not.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
381722944 https://github.com/pydata/xarray/issues/2064#issuecomment-381722944 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDM4MTcyMjk0NA== rabernat 1197350 2018-04-16T19:35:12Z 2018-04-16T19:35:12Z MEMBER

so you're fooling xarray into not including the time dimension in your non-time variables by making them coordinates in the above example?

Exactly. They are coordinates. Those variables are usually related to grid geometry or constants, as I presume is refBottomDepth in your example.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
381717472 https://github.com/pydata/xarray/issues/2064#issuecomment-381717472 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDM4MTcxNzQ3Mg== rabernat 1197350 2018-04-16T19:15:19Z 2018-04-16T19:15:19Z MEMBER

👍 This is a persistent problem for me as well.

I often find myself writing a preprocessor function like this python def process_coords(ds, concat_dim='time', drop=True): coord_vars = [v for v in ds.data_vars if concat_dim not in ds[v].dims] if drop: return ds.drop(coord_vars) else: return ds.set_coords(coord_vars) ds = xr.open_mfdataset('*.nc', preprocess=process_coords)

The reason to drop the coordinates is to avoid the comparison that happens when you concatenate coords.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
381707540 https://github.com/pydata/xarray/issues/2064#issuecomment-381707540 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDM4MTcwNzU0MA== shoyer 1217238 2018-04-16T18:42:06Z 2018-04-16T18:42:06Z MEMBER

What happens if you open multiple files with open_mfdataset(), e.g., for both January and February. Does it result in a dataset with the right dimensions on each variable?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 16.311ms · About: xarray-datasette