home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 454423937

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/2662#issuecomment-454423937 https://api.github.com/repos/pydata/xarray/issues/2662 454423937 MDEyOklzc3VlQ29tbWVudDQ1NDQyMzkzNw== 35968931 2019-01-15T15:05:22Z 2019-01-15T15:05:22Z MEMBER

Yes thankyou @malmans2, this is very helpful!

I suspect the issue is that we're now using some different combination of merge/concat.

This was very puzzling because the code is supposed to split the datasets up according to their data variables, which means merge won't be used to concatenate and this should be fast, as before.

But I found the problem! In _auto_combine_1d I should have sorted the datasets before attempting to group them by data variable, i.e. I needed a line python sorted_datasets = sorted(datasets, key=lambda ds: tuple(sorted(ds)))

before

python grouped = itertools.groupby(sorted_datasets, key=lambda ds: tuple(sorted(ds)))

With this change then I get ```python

No longer slow if netCDFs are stored in several folders:

%timeit ds_2folders = xr.open_mfdataset('rep/.nc', concat_dim='T') 9.35 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ```

Without this pre-sorting, itertools.groupby isn't guaranteed to do the grouping properly (in contrast to itertoolz.groupby which is), and as a result wasn't necessarily grouping the datasets by their variables. Then it wouldn't have finished concatenating along the dimension 'T' before it tried to merge everything back together.

Whether or not groupby sorted properly depended on the order of datasets in the input to groupby, which eventually depended on the way they were loaded (as the example in this issue makes clear).

The reason this mistake got past the unit tests is that auto_combine still gives the correct result in every case! Merge will still combine these datasets, but it will load their values in first to check that's okay, which was why it was ~1000 times slower. None of the unit tests checked performance though, and the tests I wrote were all supposed to be very fast, so this slowdown wasn't noticeable in any of them.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  397063221
Powered by Datasette · Queries took 0.761ms · About: xarray-datasette