home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

7 rows where issue = 314764258 and user = 10638475 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • bonnland · 7 ✖

issue 1

  • concat_dim getting added to *all* variables of multifile datasets · 7 ✖

author_association 1

  • NONE 7
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
523968827 https://github.com/pydata/xarray/issues/2064#issuecomment-523968827 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUyMzk2ODgyNw== bonnland 10638475 2019-08-22T16:01:40Z 2019-08-22T16:08:58Z NONE

I have tried to understand why the xarray developers decided to provide their own options for concatenation. I am not an experienced user of xarray, but I can't find any discussion of how the current options for concatenation were derived.

The pandas concat() function uses the option join = {'inner', 'outer', 'left', 'right'} in order to mimic logical database join operations. If there is a reason that xarray cannot do the same, it is not obvious to me. I think the pandas options have the advantage of logical simplicity and traditional usage within database systems.

Perhaps the reason is that xarray is modeling collections of variables, rather than a single dataframe, as with pandas. But even then, it seems like the pandas rules can be applied on a per-variable basis.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
512005032 https://github.com/pydata/xarray/issues/2064#issuecomment-512005032 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMjAwNTAzMg== bonnland 10638475 2019-07-16T22:01:59Z 2019-07-16T22:50:39Z NONE

Can you give a specific example of the behavior in question?

Here is the most specific thing I can say: If I switch the default value for data_vars to 'minimal' for concat() and open_mfdataset(), then I get a lot of failing unit tests (when running "pytest xarray -n 4". I may be wrong about why they are failing. The unit tests have comments in them, like "Check pandas compatibility"; see for example, line 370 in test_duck_array_ops.py for an example instruction that raises a ValueError exception. Many failures appear to be caused by a ValueError exception being raised, like in the final example you have in your notebook.

I hope this is specific enough; I realize that I'm not deeply comprehending what the unit tests are actually supposed to be testing.

UPDATE: @shoyer it could be that unit tests are failing because, as your final example shows, you get an error for data_vars='minimal' if any variables have different values across datasets, when adding a new concatentation dimension. If this is the reason so many unit tests are failing, then the failures are a red herring and should probably be ignored/rewritten.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
511987332 https://github.com/pydata/xarray/issues/2064#issuecomment-511987332 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMTk4NzMzMg== bonnland 10638475 2019-07-16T21:06:58Z 2019-07-16T21:06:58Z NONE

@shoyer I'm sorry I didn't look at your examples more closely at first. I see now that your first example of using data_vars='minimal' is already preserving one instance of the variable x, and I was suggesting earlier that this variable was not being included in the concatenation.

So I am not clear on why so many unit tests fail when I switch the default value for data_vars to 'minimal'. The output from your examples seems compatible with Pandas concat, though I don't understand Pandas very well yet.

I wonder if the unit tests that fail are written correctly. I have to add that I spent an entire day trying to understand the code in concat.py, by stepping through it for several unit tests. I found the code quite difficult to understand.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
511903346 https://github.com/pydata/xarray/issues/2064#issuecomment-511903346 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMTkwMzM0Ng== bonnland 10638475 2019-07-16T17:06:46Z 2019-07-16T17:47:45Z NONE

@shoyer Your explanation makes sense, but there are unit tests that expect the default concat() behavior to be the same as default behavior for Pandas concat(), which tries to perform an "outer" join between dataframes.

Therefore, from my limited understanding, the default behavior for xarray concat() should be to preserve all variables. If this default behavior changes, then it may break code making these expectations.

Can we get a perspective from the author of concat.py, @TomNicholas ? Thanks.

Specifically, what should the default behavior of concat() be, when both datasets include a variable that does not include the concatenation dimension? Currently, the concat dimension is added, and the result is a "stacked" version of the variable. Others have argued that this variable should not be included in the concat() result by default, but this appears to break compatibility with Pandas concat(). Another possibility could be to include the first instance of the variable in the result set, throwing away any other instances of the same variable, so a "stacking" dimension is not needed. This would potentially lose information if the variable instances are not identical, however.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
511583067 https://github.com/pydata/xarray/issues/2064#issuecomment-511583067 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMTU4MzA2Nw== bonnland 10638475 2019-07-15T21:48:15Z 2019-07-15T21:50:50Z NONE

@dcherian . I believe you are correct in principle, but there is a logical problem that is expensive to evaluate. The difficult case is when two datasets have a variable with the same name, and that variable does not include the concatenation dimension. In order to align the datasets for concatenation, both variables would need to be identical. The resulting dataset would just have one (unchanged) instance of that variable, say from the first dataset. I think someone along the way decided this operation was too expensive. This is from concat.py, lines 302-307:

# stack up each variable to fill-out the dataset (in order) for k in datasets[0].variables: if k in concat_over: vars = ensure_common_dims([ds.variables[k] for ds in datasets]) combined = concat_vars(vars, dim, positions) insert_result_variable(k, combined)

So I think some consensus needs to be reached, about whether it is a good idea to load these variables into memory to check for identical-ness between them.

Or another possibility is that we leave "unique" variables alone: if a variable exists only once across all datasets being concatenated, we do not add the concatenation dimension to it. This might solve @xylar original poster's issue when opening a single dataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
511210149 https://github.com/pydata/xarray/issues/2064#issuecomment-511210149 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMTIxMDE0OQ== bonnland 10638475 2019-07-14T15:03:29Z 2019-07-14T21:36:03Z NONE

So there are some units tests that assert the behavior for open_mfdataset() is identical to the behavior for concat(). This implies that if we change the default data_vars value from "all" to "minimal" for one function, we need to change it for both functions.

@shoyer I think you suggested that concat() default behavior should change in #2145, in the same way it will change for open_mfdataset. So I am going to add this change to the pull request.

UPDATE: There is a problem with changing concat() away from having data_ vars='all'. This breaks many unit tests that check for compatibility with Pandas. What I've been told is that Pandas concat() will include all unique variables from each dataframe. This is what data_vars='all' will also do. By changing to data_vars='minimal', only data variables with the specified concatenation dimension will be included. So it seems that in order to stay compatible with Pandas, we need to include all data variables, but not add the concatenation dimension to data variables that do not already have that dimension. The problem, however, is what to do when both datasets have a variable x without the concatenation dimension? What should the resulting concatenation look like? I think in this case, the concat dimension should be added, to preserve information. It's just the unique variables that should be left alone.

Please, could someone confirm that I have understood the problem correctly? Thank you in advance.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258
511140703 https://github.com/pydata/xarray/issues/2064#issuecomment-511140703 https://api.github.com/repos/pydata/xarray/issues/2064 MDEyOklzc3VlQ29tbWVudDUxMTE0MDcwMw== bonnland 10638475 2019-07-13T17:41:31Z 2019-07-13T22:51:13Z NONE

I'm a new developer at the SciPy 2019 Sprints, and I'm interested in making a pull request for this issue as a learning step.

@henrikca Would this be useful? Or would this conflict with your work?

I went ahead and made a pull request because it looks like a very small change, potentially.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  concat_dim getting added to *all* variables of multifile datasets 314764258

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.81ms · About: xarray-datasette