home / github / pull_requests

Menu
  • Search all tables
  • GraphQL API

pull_requests: 39752514

This data as json

id node_id number state locked title user body created_at updated_at closed_at merged_at merge_commit_sha assignee milestone draft head base author_association auto_merge repo url merged_by
39752514 MDExOlB1bGxSZXF1ZXN0Mzk3NTI1MTQ= 468 closed 0 Option for closing files with scipy backend 1197350 This addresses issue #463, in which open_mfdataset failed when trying to open a list of files longer than my system's ulimit. I tried to find a solution in which the underlying netcdf file objects are kept closed by default and only reopened "when needed". I ended up subclassing scipy.io.netcdf_file and overwriting the variable attribute with a property which first checks whether the file is open or closed and opens it if needed. That was the easy part. The hard part was figuring out when to close them. The problem is that a couple of different parts of the code (e.g. each individual variable and also the datastore object itself) keep references to the netcdf_file object. In the end I used the debugger to find out when during initialization the variables were actually being read and added some calls to close() in various different places. It is relatively easy to close the files up at the end of the initialization, but it was much harder to make sure that the whole array of files is never open at the same time. I also had to disable mmap when this option is active. This solution is messy and, moreover, extremely slow. There is a factor of ~100 performance penalty during initialization for reopening and closing the files all the time (but only a factor of 10 for the actual calculation). I am sure this could be reduced if someone who understands the code better found some judicious points at which to call close() on the netcdf_file. The loss of mmap also sucks. This option can be accessed with the close_files key word, which I added to api. Timing for loading and doing a calculation with close_files=True: ``` python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=True) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files() ``` output: ``` 3 open files CPU times: user 11.1 s, sys: 17.5 s, total: 28.5 s Wall time: 27.7 s 2 open files 0.0055650632367 CPU times: user 649 ms, sys: 974 ms, total: 1.62 s Wall time: 633 ms 2 open files ``` Timing for loading and doing a calculation with close_files=False (default, should revert to old behavior): ``` python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=False) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files() ``` ``` 3 open files CPU times: user 264 ms, sys: 85.3 ms, total: 349 ms Wall time: 291 ms 22 open files 0.0055650632367 CPU times: user 174 ms, sys: 141 ms, total: 315 ms Wall time: 56 ms 22 open files ``` This is not a very serious pull request, but I spent all day on it, so I thought I would share. Maybe you can see some obvious way to improve it... 2015-07-11T21:24:24Z 2015-08-10T12:50:45Z 2015-08-09T00:04:12Z   fe363c15d6c4f23d664d8729a54c9c2ce5a4e918     0 200aeb006781528cf6d4ca2f118d7f9257bd191b 200aeb006781528cf6d4ca2f118d7f9257bd191b MEMBER   13221727 https://github.com/pydata/xarray/pull/468  

Links from other tables

  • 0 rows from pull_requests_id in labels_pull_requests
Powered by Datasette ยท Queries took 1.004ms