pull_requests: 39752514

This data as json

id	node_id	number	state	locked	title	user	body	created_at	updated_at	closed_at	merged_at	merge_commit_sha	assignee	milestone	draft	head	base	author_association	auto_merge	repo	url	merged_by
39752514	MDExOlB1bGxSZXF1ZXN0Mzk3NTI1MTQ=	468	closed	0	Option for closing files with scipy backend	1197350	This addresses issue #463, in which open_mfdataset failed when trying to open a list of files longer than my system's ulimit. I tried to find a solution in which the underlying netcdf file objects are kept closed by default and only reopened "when needed". I ended up subclassing scipy.io.netcdf_file and overwriting the variable attribute with a property which first checks whether the file is open or closed and opens it if needed. That was the easy part. The hard part was figuring out when to close them. The problem is that a couple of different parts of the code (e.g. each individual variable and also the datastore object itself) keep references to the netcdf_file object. In the end I used the debugger to find out when during initialization the variables were actually being read and added some calls to close() in various different places. It is relatively easy to close the files up at the end of the initialization, but it was much harder to make sure that the whole array of files is never open at the same time. I also had to disable mmap when this option is active. This solution is messy and, moreover, extremely slow. There is a factor of ~100 performance penalty during initialization for reopening and closing the files all the time (but only a factor of 10 for the actual calculation). I am sure this could be reduced if someone who understands the code better found some judicious points at which to call close() on the netcdf_file. The loss of mmap also sucks. This option can be accessed with the close_files key word, which I added to api. Timing for loading and doing a calculation with close_files=True: ``` python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101.nc', engine='scipy', close_files=True) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files() ``` output: ``` 3 open files CPU times: user 11.1 s, sys: 17.5 s, total: 28.5 s Wall time: 27.7 s 2 open files 0.0055650632367 CPU times: user 649 ms, sys: 974 ms, total: 1.62 s Wall time: 633 ms 2 open files ``` Timing for loading and doing a calculation with close_files=False (default, should revert to old behavior): ``` python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101.nc', engine='scipy', close_files=False) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files() ``` ``` 3 open files CPU times: user 264 ms, sys: 85.3 ms, total: 349 ms Wall time: 291 ms 22 open files 0.0055650632367 CPU times: user 174 ms, sys: 141 ms, total: 315 ms Wall time: 56 ms 22 open files ``` This is not a very serious pull request, but I spent all day on it, so I thought I would share. Maybe you can see some obvious way to improve it...	2015-07-11T21:24:24Z	2015-08-10T12:50:45Z	2015-08-09T00:04:12Z		fe363c15d6c4f23d664d8729a54c9c2ce5a4e918			0	200aeb006781528cf6d4ca2f118d7f9257bd191b	200aeb006781528cf6d4ca2f118d7f9257bd191b	MEMBER		13221727	https://github.com/pydata/xarray/pull/468

Links from other tables

0 rows from pull_requests_id in labels_pull_requests