home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

2 rows where issue = 397063221 and user = 35968931 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

user 1

  • TomNicholas · 2 ✖

issue 1

  • open_mfdataset in v.0.11.1 is very slow · 2 ✖

author_association 1

  • MEMBER 2
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
454423937 https://github.com/pydata/xarray/issues/2662#issuecomment-454423937 https://api.github.com/repos/pydata/xarray/issues/2662 MDEyOklzc3VlQ29tbWVudDQ1NDQyMzkzNw== TomNicholas 35968931 2019-01-15T15:05:22Z 2019-01-15T15:05:22Z MEMBER

Yes thankyou @malmans2, this is very helpful!

I suspect the issue is that we're now using some different combination of merge/concat.

This was very puzzling because the code is supposed to split the datasets up according to their data variables, which means merge won't be used to concatenate and this should be fast, as before.

But I found the problem! In _auto_combine_1d I should have sorted the datasets before attempting to group them by data variable, i.e. I needed a line python sorted_datasets = sorted(datasets, key=lambda ds: tuple(sorted(ds)))

before

python grouped = itertools.groupby(sorted_datasets, key=lambda ds: tuple(sorted(ds)))

With this change then I get ```python

No longer slow if netCDFs are stored in several folders:

%timeit ds_2folders = xr.open_mfdataset('rep/.nc', concat_dim='T') 9.35 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ```

Without this pre-sorting, itertools.groupby isn't guaranteed to do the grouping properly (in contrast to itertoolz.groupby which is), and as a result wasn't necessarily grouping the datasets by their variables. Then it wouldn't have finished concatenating along the dimension 'T' before it tried to merge everything back together.

Whether or not groupby sorted properly depended on the order of datasets in the input to groupby, which eventually depended on the way they were loaded (as the example in this issue makes clear).

The reason this mistake got past the unit tests is that auto_combine still gives the correct result in every case! Merge will still combine these datasets, but it will load their values in first to check that's okay, which was why it was ~1000 times slower. None of the unit tests checked performance though, and the tests I wrote were all supposed to be very fast, so this slowdown wasn't noticeable in any of them.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset in v.0.11.1 is very slow 397063221
452462499 https://github.com/pydata/xarray/issues/2662#issuecomment-452462499 https://api.github.com/repos/pydata/xarray/issues/2662 MDEyOklzc3VlQ29tbWVudDQ1MjQ2MjQ5OQ== TomNicholas 35968931 2019-01-08T21:43:31Z 2019-01-08T21:43:31Z MEMBER

I'm not sure what might be causing this, but I wonder if you could help narrow it down a bit?

Can you for example see if it's making it past here? That would at least tell us if it is opening each of the datasets okay.

(Or even better: post some example datasets which will cause this problem?)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset in v.0.11.1 is very slow 397063221

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 4398.99ms · About: xarray-datasette