issue_comments: 454423937

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/2662#issuecomment-454423937	https://api.github.com/repos/pydata/xarray/issues/2662	454423937	MDEyOklzc3VlQ29tbWVudDQ1NDQyMzkzNw==	35968931	2019-01-15T15:05:22Z	2019-01-15T15:05:22Z	MEMBER	Yes thankyou @malmans2, this is very helpful! I suspect the issue is that we're now using some different combination of merge/concat. This was very puzzling because the code is supposed to split the datasets up according to their data variables, which means merge won't be used to concatenate and this should be fast, as before. But I found the problem! In `_auto_combine_1d` I should have sorted the datasets before attempting to group them by data variable, i.e. I needed a line `python sorted_datasets = sorted(datasets, key=lambda ds: tuple(sorted(ds)))` before `python grouped = itertools.groupby(sorted_datasets, key=lambda ds: tuple(sorted(ds)))` With this change then I get ```python No longer slow if netCDFs are stored in several folders: %timeit ds_2folders = xr.open_mfdataset('rep/.nc', concat_dim='T') 9.35 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Without this pre-sorting, `itertools.groupby` isn't guaranteed to do the grouping properly (in contrast to `itertoolz.groupby` which is), and as a result wasn't necessarily grouping the datasets by their variables. Then it wouldn't have finished concatenating along the dimension `'T'` before it tried to merge everything back together. Whether or not groupby sorted properly depended on the order of datasets in the input to groupby, which eventually depended on the way they were loaded (as the example in this issue makes clear). The reason this mistake got past the unit tests is that `auto_combine` still gives the correct result in every case! Merge will still combine these datasets, but it will load their values in first to check that's okay, which was why it was ~1000 times slower. None of the unit tests checked performance though, and the tests I wrote were all supposed to be very fast, so this slowdown wasn't noticeable in any of them.	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 }		397063221

issue_comments: 454423937

No longer slow if netCDFs are stored in several folders: