github: issue_comments: 6 rows where issue = 397063221 sorted by updated

6 rows where issue = 397063221 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
454450672	https://github.com/pydata/xarray/issues/2662#issuecomment-454450672	https://api.github.com/repos/pydata/xarray/issues/2662	MDEyOklzc3VlQ29tbWVudDQ1NDQ1MDY3Mg==	dcherian 2448579	2019-01-15T16:14:12Z	2019-01-15T16:14:12Z	MEMBER	We have airspeedvelocity performance tests. I don't know there's one for auto_combine but maybe you can add one @TomNicholas	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset in v.0.11.1 is very slow 397063221
454439392	https://github.com/pydata/xarray/issues/2662#issuecomment-454439392	https://api.github.com/repos/pydata/xarray/issues/2662	MDEyOklzc3VlQ29tbWVudDQ1NDQzOTM5Mg==	malmans2 22245117	2019-01-15T15:45:03Z	2019-01-15T15:45:03Z	CONTRIBUTOR	I checked PR #2678 with the data that originated the issue and it fixes the problem!	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 1, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset in v.0.11.1 is very slow 397063221
454423937	https://github.com/pydata/xarray/issues/2662#issuecomment-454423937	https://api.github.com/repos/pydata/xarray/issues/2662	MDEyOklzc3VlQ29tbWVudDQ1NDQyMzkzNw==	TomNicholas 35968931	2019-01-15T15:05:22Z	2019-01-15T15:05:22Z	MEMBER	Yes thankyou @malmans2, this is very helpful! I suspect the issue is that we're now using some different combination of merge/concat. This was very puzzling because the code is supposed to split the datasets up according to their data variables, which means merge won't be used to concatenate and this should be fast, as before. But I found the problem! In `_auto_combine_1d` I should have sorted the datasets before attempting to group them by data variable, i.e. I needed a line `python sorted_datasets = sorted(datasets, key=lambda ds: tuple(sorted(ds)))` before `python grouped = itertools.groupby(sorted_datasets, key=lambda ds: tuple(sorted(ds)))` With this change then I get ```python No longer slow if netCDFs are stored in several folders: %timeit ds_2folders = xr.open_mfdataset('rep/.nc', concat_dim='T') 9.35 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Without this pre-sorting, `itertools.groupby` isn't guaranteed to do the grouping properly (in contrast to `itertoolz.groupby` which is), and as a result wasn't necessarily grouping the datasets by their variables. Then it wouldn't have finished concatenating along the dimension `'T'` before it tried to merge everything back together. Whether or not groupby sorted properly depended on the order of datasets in the input to groupby, which eventually depended on the way they were loaded (as the example in this issue makes clear). The reason this mistake got past the unit tests is that `auto_combine` still gives the correct result in every case! Merge will still combine these datasets, but it will load their values in first to check that's okay, which was why it was ~1000 times slower. None of the unit tests checked performance though, and the tests I wrote were all supposed to be very fast, so this slowdown wasn't noticeable in any of them.	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 }	open_mfdataset in v.0.11.1 is very slow 397063221
454351420	https://github.com/pydata/xarray/issues/2662#issuecomment-454351420	https://api.github.com/repos/pydata/xarray/issues/2662	MDEyOklzc3VlQ29tbWVudDQ1NDM1MTQyMA==	shoyer 1217238	2019-01-15T10:56:03Z	2019-01-15T10:56:03Z	MEMBER	@malmans2 thanks for this reproducible test case! From xarray's perspective, the difference is the order in which the arrays are concatenated/processed. This is determined by sorting the (globbed) file names: ``` In [16]: sorted(glob.glob('rep/.nc')) Out[16]: ['rep0/dsA0.nc', 'rep0/dsB0.nc', 'rep1/dsA1.nc', 'rep1/dsB1.nc'] In [17]: sorted(glob.glob('*.nc')) Out[17]: ['dsA0.nc', 'dsA1.nc', 'dsB0.nc', 'dsB1.nc'] ``` It appears that the slow case [A0, B0, A1, B1] now requires computing data with dask, whereas [A0, A1, B0, B1] does not. I suspect the issue is that we're now using some different combination of `merge`/`concat`. In particular it looks like the compute is being triggered from within `merge`. This sort of makes sense: if we're using `merge` instead of `concat` for joining along the dimension `T`, that is super slow because that goes through a path that checks arrays for conflicting values by loading data into memory (even though in this case that isn't possible, because the original coordinates were not overlapping). We could (and should) optimize this path in merge to avoid eagerly loading data, but the immediate fix here is probably to make sure we're using concat instead of merge.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset in v.0.11.1 is very slow 397063221
454086847	https://github.com/pydata/xarray/issues/2662#issuecomment-454086847	https://api.github.com/repos/pydata/xarray/issues/2662	MDEyOklzc3VlQ29tbWVudDQ1NDA4Njg0Nw==	malmans2 22245117	2019-01-14T17:20:03Z	2019-01-14T17:20:03Z	CONTRIBUTOR	I've created a little script to reproduce the problem. @TomNicholas it looks like datasets are opened correctly. The problem arises when `open_mfdatasets` calls `_auto_combine`. Indeed, `_auto_combine` was introduced in v0.11.1 ```python import numpy as np import xarray as xr import os Tsize=100; T = np.arange(Tsize); Xsize=900; X = np.arange(Xsize); Ysize=800; Y = np.arange(Ysize) data = np.random.randn(Tsize, Xsize, Ysize) for i in range(2): `# Create 2 datasets with different variables dsA = xr.Dataset({'A': xr.DataArray(data, coords={'T': T+iTsize}, dims=('T', 'X', 'Y'))}) dsB = xr.Dataset({'B': xr.DataArray(data, coords={'T': T+iTsize}, dims=('T', 'X', 'Y'))}) # Save datasets in one folder dsA.to_netcdf('dsA'+str(i)+'.nc') dsB.to_netcdf('dsB'+str(i)+'.nc') # Save datasets in two folders dirname='rep'+str(i) os.mkdir(dirname) dsA.to_netcdf(dirname+'/'+'dsA'+str(i)+'.nc') dsB.to_netcdf(dirname+'/'+'dsB'+str(i)+'.nc')` ``` Fast if netCDFs are stored in one folder: `python %%time ds_1folder = xr.open_mfdataset('.nc', concat_dim='T')` `CPU times: user 49.9 ms, sys: 5.06 ms, total: 55 ms Wall time: 59.7 ms` Slow if netCDFs are stored in several folders: `python %%time ds_2folders = xr.open_mfdataset('rep/.nc', concat_dim='T')` `CPU times: user 8.6 s, sys: 5.95 s, total: 14.6 s Wall time: 10.3 s` Fast if files containing different variables are opened separately, then merged: `python %%time ds_A = xr.open_mfdataset('rep/dsA.nc', concat_dim='T') ds_B = xr.open_mfdataset('rep/dsB*.nc', concat_dim='T') ds_merged = xr.merge([ds_A, ds_B])` `CPU times: user 33.8 ms, sys: 3.7 ms, total: 37.5 ms Wall time: 34.5 ms`	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 }	open_mfdataset in v.0.11.1 is very slow 397063221
452462499	https://github.com/pydata/xarray/issues/2662#issuecomment-452462499	https://api.github.com/repos/pydata/xarray/issues/2662	MDEyOklzc3VlQ29tbWVudDQ1MjQ2MjQ5OQ==	TomNicholas 35968931	2019-01-08T21:43:31Z	2019-01-08T21:43:31Z	MEMBER	I'm not sure what might be causing this, but I wonder if you could help narrow it down a bit? Can you for example see if it's making it past here? That would at least tell us if it is opening each of the datasets okay. (Or even better: post some example datasets which will cause this problem?)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset in v.0.11.1 is very slow 397063221

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

6 rows where issue = 397063221 sorted by updated_at descending

No longer slow if netCDFs are stored in several folders:

Fast if netCDFs are stored in one folder:

Slow if netCDFs are stored in several folders:

Fast if files containing different variables are opened separately, then merged:

Advanced export