issue_comments: 306838587

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/1396#issuecomment-306838587	https://api.github.com/repos/pydata/xarray/issues/1396	306838587	MDEyOklzc3VlQ29tbWVudDMwNjgzODU4Nw==	9655353	2017-06-07T15:51:34Z	2017-06-07T15:53:06Z	NONE	We had similar performance issues with xarray+dask, which we solved by using a chunking heuristic when opening a dataset. You can read about it in #1440. Now, in our case the data really wouldn't fit in memory, which is clearly not the case in your gist. Anyway, I thought I'd play around with your gist and see if chunking can make a difference. I couldn't use your example directly, as the data it generates in memory is too large for the dev VM I'm on with this. So I changed the generated file size to (12, 1000, 2000), the essence of your gist remained though, it would take ~25 seconds to do the time series extraction, whereas ~800 ms using `extract_point_xarray()`. So, I thought I'd try our 'chunking heuristic' on the generated test datasets. Simply split the dataset in 2x2 chunks along spatial dimensions. So: `python ds = xr.open_mfdataset(all_files, decode_cf=False, chunks={'time':12, 'x':1000, 'y':500})` To my surprise: ```python time extracting a timeseries of a single point y, x = 200, 300 with ProgressBar(): %time ts = ds.data[:, y, x].load() `results in` [########################################] \| 100% Completed \| 0.7s CPU times: user 124 ms, sys: 268 ms, total: 392 ms Wall time: 826 ms ``` I'm not entirely sure what's happening, as the file obviously fits in memory just fine because the looping thing works well. Maybe it's fine when you loop through them one by one, but the single file chunk turns out to be too large when dask wants to parallelize the whole thing. I really have no idea. I'd be very intrigued to see if you can get a similar result by doing a simple 2x2xtime chunking. By the way, `chunks={'x':1000, 'y':500, 'time':1}` produces similar results with some overhead. Extraction took ~1.5 seconds. EDIT: `python print(xr.__version__) print(dask.__version__)` `0.9.5 0.14.1`	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		225774140

issue_comments: 306838587

time extracting a timeseries of a single point