issue_comments: 781407863

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/1385#issuecomment-781407863	https://api.github.com/repos/pydata/xarray/issues/1385	781407863	MDEyOklzc3VlQ29tbWVudDc4MTQwNzg2Mw==	53343824	2021-02-18T15:06:13Z	2021-02-18T15:06:13Z	NONE	setting parallel=True seg faults... I'm betting that is some quirk of my python environment, though. This is important! Otherwise that timing scales with number of files. If you get that to work, then you can convert to a dask dataframe and keep things lazy. Indeed @dcherian -- it took some experimentation to get the right engine to support parallel execution and even then, results are still mixed, which, to me, means further work is needed to isolate the issue. Along the lines of suggestions here (thanks @jmccreight for pointing this out), we've introduced a very practical pre-processing step to rewrite the datasets so that the read is not striped across the file system, effectively isolating the performance bottleneck to a position where it can be dealt with independently. Of course, such an asynchronous workflow is not possible in all situations, so we're still looking at improving the direct performance. Two notes as we keep working: - The preprocessor. Reading and re-manipulating an individual dataset is lightning fast. We saw that a small change or adjustment in the individual files, made with a preprocessor, made the multi-file read massively faster. - The "more sophisticated example" referenced here has proven to be very useful.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		224553135