home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where author_association = "MEMBER", issue = 314764258 and user = 2448579 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: updated_at (date)

These facets timed out: author_association

user 1

  • dcherian · 5 ✖

issue 1

  • concat_dim getting added to *all* variables of multifile datasets · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
531818131 https://github.com/pydata/xarray/issues/2064#issuecomment-531818131 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUzMTgxODEzMQ== dcherian 2448579 2019-09-16T15:03:12Z 2019-09-16T15:03:12Z MEMBER

#3239 has been merged. Now minimal is more useful since you can specify compat="override" to skip compatibility checking.

What's left is to change defaults to implement @shoyer's comment

So I'm thinking that we probably want to combine "all" and "minimal" into a single mode to use as the default, and remove the other behavior, which is either useless or broken. Maybe it would make sense to come up with a new name for this mode, and to make both "all" and "minimal" deprecated aliases for it? In the long term, this leaves only two "automatic" modes for xarray.concat, which should make things simpler for users trying to figure this out.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
524021001 https://github.com/pydata/xarray/issues/2064#issuecomment-524021001 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUyNDAyMTAwMQ== dcherian 2448579 2019-08-22T18:22:37Z 2019-08-22T18:22:37Z MEMBER

Thanks for your input @bonnland.

The pandas concat() function uses the option join = {'inner', 'outer', 'left', 'right'} in order to mimic logical database join operations. If there is a reason that xarray cannot do the same, it is not obvious to me. I think the pandas options have the advantage of logical simplicity and traditional usage within database systems.

We do have a join argument that takes these arguments + 'override' which was added recently to skip expensive comparisons. This works for "indexes" or "dimension coordinates". An example: if you have 2 dataarrays, one on a coordinate x=[1, 2, 3] and the other on x=[2,3,4], join lets you control the x coordinate of the output. This is done by xr.align.

What's under discussion here is what to do about variables duplicated across datasets or indeed, how do we know that these variables are duplicated across datasets when concatenating other variables.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
523960862 https://github.com/pydata/xarray/issues/2064#issuecomment-523960862 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUyMzk2MDg2Mg== dcherian 2448579 2019-08-22T15:42:10Z 2019-08-22T15:42:10Z MEMBER

I have a draft solution in #3239. It adds a new mode called "sensible" that acts like "all" when the concat dimension doesn't exist in the dataset and acts like "minimal" when the dimension is present. We can decide whether this is the right way i.e. add a new mode but the more fundamental problem is below.

The issue is dealing with variables that should not be concatentated in "minimal" mode (e.g. time-invariant non dim coords when concatenating in time). In this case, we want to skip the equality checks in _calc_concat_over. This is a common reason for poor open_mfdataset performance.

I thought the clean way to do this would be to add the compat kwarg to concat and then add compat='override' since the current behaviour is effectively compat='equals'.

However, merge takes compat too and concat and merge support different compat arguments at present. This makes it complicated to easily thread compat down from combine or open_mfdataset without adding concat_compat and merge_compat which is silly.

So do we want to support all the other compat modes in concat? Things like broadcast_equals or no_conflicts are funny because they're basically merge operations and it means concat acts like both stack, concat and merge. OTOH if you have a set of variables with the same name from different datasets and you want to pick one of those (i.e. no concatenation), then you're basically doing merge anyway. This would require some refactoring since concat assumes the first dataset is a template for the rest.

@shoyer What do you think?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
519149757 https://github.com/pydata/xarray/issues/2064#issuecomment-519149757 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxOTE0OTc1Nw== dcherian 2448579 2019-08-07T15:32:16Z 2019-08-07T15:32:16Z MEMBER

Maybe it would make sense to come up with a new name for this mode, and to make both "all" and "minimal" deprecated aliases for it?

I'm in favour of this. What should we name this mode?

One comment on "existing dimensions" mode:

  • "minimal" does the right thing, concatenating only variables with the dimension.

For variables without the dimension, this will still raise a ValueError because compat can only be 'equals' or 'identical'. It seems to me like we need compat='override' and/or compat='tolerance', tolerance=... that would use numpy's approximate equality testing. This checking of non-dimensional coordinates is a common source of mfdataset issues. What do you think?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
511468454 https://github.com/pydata/xarray/issues/2064#issuecomment-511468454 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMTQ2ODQ1NA== dcherian 2448579 2019-07-15T16:15:51Z 2019-07-15T16:15:51Z MEMBER

@bonnland I don't think you want to change the default data_vars but instead update the heuristics as in this comment

we shouldn't implicitly add a new dimensions to variables in the case where the dimension already exists in the dataset. We only need the heuristics/comparisons when an entirely new dimension is being added.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 3922.785ms · About: xarray-datasette