issue_comments: 561920115

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/1385#issuecomment-561920115	https://api.github.com/repos/pydata/xarray/issues/1385	561920115	MDEyOklzc3VlQ29tbWVudDU2MTkyMDExNQ==	1197350	2019-12-05T01:09:25Z	2019-12-05T01:09:25Z	MEMBER	In your twitter thread you said Do any of my xarray/dask folks know why open_mfdataset takes such a significant amount of time compared to looping over a list of files? Each file corresponds to a new time, just wanting to open multiple times at once... The general reason for this is usually that `open_mfdataset` performs coordinate compatibility checks when it concatenates the files. It's useful to actually read the code of open_mfdataset to see how it works. First, all the files are opened individually https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L900-L903 You can recreate this step outside of xarray yourself by doing something like `python from glob import glob datasets = [xr.open_dataset(fname, chunks={}) for fname in glob('.nc')]` Once each dataset is open, xarray calls out to one of its combine functions. This logic has gotten more complex over the years as different options have been introduced, but the gist is this: https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L947-L952 You can reproduce this step outside of xarray, e.g. `ds = xr.concat(datasets, dim='time')` At that point, various checks will kick in to be sure that the coordinates in the different datasets are compatible. Performing these checks requires the data to be read eagerly, which can be a source of slow performance. Without seeing more details about your files, it's hard to know exactly where the issue lies. A good place to start is to simply drop all coordinates from your data as a preprocessing step. ``` def drop_all_coords(ds): return ds.reset_coords(drop=True) xr.open_mfdataset('.nc', combine='by_coords', preprocess=drop_all_coords) ``` If you observe a big speedup, this points at coordinate compatibility checks as the culprit. From there you can experiment with the various options for `open_mfdataset`, such as `coords='minimal', compat='override'`, etc. Once you post your file details, we can provide more concrete suggestions.	{ "total_count": 6, "+1": 6, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		224553135