home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

47 rows where issue = 94328498 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 10

  • shoyer 15
  • pwolfram 8
  • mangecoeur 6
  • rabernat 5
  • cpaulik 4
  • ajoros 3
  • sebhahn 3
  • mrocklin 1
  • darothen 1
  • kmpaul 1

author_association 3

  • MEMBER 21
  • CONTRIBUTOR 15
  • NONE 11

issue 1

  • open_mfdataset too many files · 47 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
347165242 https://github.com/pydata/xarray/issues/463#issuecomment-347165242 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDM0NzE2NTI0Mg== sebhahn 5929935 2017-11-27T12:17:17Z 2017-11-27T12:17:17Z NONE

Thanks, I'll test it!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
347157526 https://github.com/pydata/xarray/issues/463#issuecomment-347157526 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDM0NzE1NzUyNg== shoyer 1217238 2017-11-27T11:40:35Z 2017-11-27T11:40:35Z MEMBER

Using autoclose=True should also fix this. On Mon, Nov 27, 2017 at 10:26 AM Sebastian Hahn notifications@github.com wrote:

Ok, I found my problem. I had to increase ulimit -n

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/463#issuecomment-347140117, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1mu2bDkvJoV-fAz8DVAKp22bOMATks5s6o5xgaJpZM4FWKen .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
347140117 https://github.com/pydata/xarray/issues/463#issuecomment-347140117 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDM0NzE0MDExNw== sebhahn 5929935 2017-11-27T10:26:56Z 2017-11-27T10:26:56Z NONE

Ok, I found my problem. I had to increase ulimit -n

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
347126256 https://github.com/pydata/xarray/issues/463#issuecomment-347126256 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDM0NzEyNjI1Ng== sebhahn 5929935 2017-11-27T09:33:29Z 2017-11-27T09:33:29Z NONE

@shoyer I just ran into this issue again (with 8000 files, each 50 kB), I'm using xarray 0.9.6 and work on some performance tests. Is there any upper limit of number of files?

File "/home/shahn/.pyenv/versions/warp_conda/envs/pyraster_env/lib/python2.7/site-packages/xarray/backends/api.py", line 505, in open_mfdataset File "/home/shahn/.pyenv/versions/warp_conda/envs/pyraster_env/lib/python2.7/site-packages/xarray/backends/api.py", line 282, in open_dataset File "/home/shahn/.pyenv/versions/warp_conda/envs/pyraster_env/lib/python2.7/site-packages/xarray/backends/netCDF4_.py", line 210, in __init__ File "/home/shahn/.pyenv/versions/warp_conda/envs/pyraster_env/lib/python2.7/site-packages/xarray/backends/netCDF4_.py", line 185, in _open_netcdf4_group File "netCDF4/_netCDF4.pyx", line 1811, in netCDF4._netCDF4.Dataset.__init__ (netCDF4/_netCDF4.c:13231) IOError: Too many open files

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288868053 https://github.com/pydata/xarray/issues/463#issuecomment-288868053 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODg2ODA1Mw== ajoros 2615433 2017-03-23T21:37:19Z 2017-03-23T21:37:19Z NONE

Yessir @pwolfram we are in business.!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288867744 https://github.com/pydata/xarray/issues/463#issuecomment-288867744 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODg2Nzc0NA== pwolfram 4295853 2017-03-23T21:36:07Z 2017-03-23T21:36:07Z CONTRIBUTOR

@ajoros should correct me if I'm wrong but it sounds like everything is working for his use case.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288835940 https://github.com/pydata/xarray/issues/463#issuecomment-288835940 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODgzNTk0MA== ajoros 2615433 2017-03-23T19:34:33Z 2017-03-23T19:34:33Z NONE

Thanks @pwolfram ... shot you a follow up email at your Gmail...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288832922 https://github.com/pydata/xarray/issues/463#issuecomment-288832922 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODgzMjkyMg== shoyer 1217238 2017-03-23T19:22:43Z 2017-03-23T19:22:43Z MEMBER

OK, I'm closing this issue as "Fixed" by #1198. Feel free to open new issue for any follow-up concerns.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288832707 https://github.com/pydata/xarray/issues/463#issuecomment-288832707 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODgzMjcwNw== pwolfram 4295853 2017-03-23T19:21:57Z 2017-03-23T19:21:57Z CONTRIBUTOR

@ajoros, #1198 was just merged so the bleeding-edge version of xarray is the one to try!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288830741 https://github.com/pydata/xarray/issues/463#issuecomment-288830741 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODgzMDc0MQ== pwolfram 4295853 2017-03-23T19:14:23Z 2017-03-23T19:14:23Z CONTRIBUTOR

@ajoros, can you try something like pip -v install --force git+ssh://git@github.com/pwolfram/xarray@fix_too_many_open_files to see if #1198 fixes your problem with your dataset, noting that you need open_mfdataset(..., autoclose=True)?

@shoyer should correct me if I'm wrong but we are almost ready to merge the code in this PR and this would be a great "in the field" check if you could try it out soon.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288829145 https://github.com/pydata/xarray/issues/463#issuecomment-288829145 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODgyOTE0NQ== ajoros 2615433 2017-03-23T19:08:37Z 2017-03-23T19:08:37Z NONE

Not sure this is good feedback at all but I just wanted to provide an additional problematic case, from my end, that is returning this "too many files" problem:

NOTE: I have the latest xarray package. I have about 365 1.7MB Netcdf files that I am trying to read using open_mfdataset() and it continuously gives me the "too many files" error and completely hangs jupyter notebooks to the point where I have to ctrl+C out of it. Note that each netcdf contains a Dataset that is 195x195x1. Obviously it's not a file-size issue as I'm not dealing with multiple gigs worth of data. Should I increase the OSX open max file limit, or will that not solve anything in my case?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288414991 https://github.com/pydata/xarray/issues/463#issuecomment-288414991 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODQxNDk5MQ== pwolfram 4295853 2017-03-22T14:25:37Z 2017-03-22T14:25:37Z CONTRIBUTOR

We are very close on #1198 and will be merging soon. This would be a great time for everyone to ensure that #1198 resolves this issue before we merge.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263723460 https://github.com/pydata/xarray/issues/463#issuecomment-263723460 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzcyMzQ2MA== pwolfram 4295853 2016-11-29T22:39:25Z 2016-11-29T23:30:59Z CONTRIBUTOR

I just realized I didn't say thank you to @shoyer et al for the advice and help. Please forgive my rudeness.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263734251 https://github.com/pydata/xarray/issues/463#issuecomment-263734251 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzczNDI1MQ== shoyer 1217238 2016-11-29T23:30:02Z 2016-11-29T23:30:02Z MEMBER

if I understand correctly the best approach as you see it to build on opener via #1128, recognizing this will be essentially "upgraded" sometime in the future, right?

Yes, exactly. I plan to merge that PR very shortly, after a few fixes for the failing tests on Windows (less than an hour of work).

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263721589 https://github.com/pydata/xarray/issues/463#issuecomment-263721589 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzcyMTU4OQ== pwolfram 4295853 2016-11-29T22:31:25Z 2016-11-29T22:31:25Z CONTRIBUTOR

@shoyer, if I understand correctly the best approach as you see it to build on opener via #1128, recognizing this will be essentially "upgraded" sometime in the future, right?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263706346 https://github.com/pydata/xarray/issues/463#issuecomment-263706346 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzcwNjM0Ng== shoyer 1217238 2016-11-29T21:35:06Z 2016-11-29T21:35:06Z MEMBER

@pwolfram NcML is just an XML specification for how variables in a set of NetCDF files can be combined into a single virtual NetCDF file. This would be useful because it would allow building a version of open_mfdataset that doesn't need to inspect every single file. So this is definitely independent of the other options.

I suspect that even the LRU cache approach would build on opener from #1128. From a design perspective in the DataStore subclasses, I would guess that both the LRU cache and my latest suggestion should look pretty similar: the appropriate methods on DataStore and the data store Array subclasses will need to use something like with self._ensure_open(): block to guard all access to underlying file objects.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263693540 https://github.com/pydata/xarray/issues/463#issuecomment-263693540 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzY5MzU0MA== pwolfram 4295853 2016-11-29T20:46:20Z 2016-11-29T20:47:30Z CONTRIBUTOR

@shoyer, you probably have the very best feel for what the most efficacious solution is to this problem in terms of fixing the issue, performance, longer utility, etc. Is there any clear winner from the following potentially non-exhaustive options?

  1. LRU cache from #798
  2. Building on opener #1128
  3. New wrapper functionality as discussed above for NcML
  4. Use of PyReshaper (e.g., short term acknowledgement that change to xarray / dask may be somewhat out of scope for current design goals)

My current analysis:

I could see our team using PyReshaper because our data output format already has inertia but this adds complexity to a workflow that intuitively should be handled inside xarray. However, I think we want to get around the file number limitation eventually because it is an issue that multiple groups keep bringing up. This is perhaps the simplest solution but it is specific to our uses and not necessarily general. Towards a general solution, we would intuitively have a fixed cost performance penalty for the opener solution but it may be the simplest and cleanest approach, at least for the short term. However, we may need the LRU cache eventually to bridge xarray / dask-distributed so implementation of opener could be a depreciated effort in the long term. The NcML approach has the flavor of a solution along the lines of PyReshaper, although my limited experience with PyReshaper and NcML precludes a more rigorous analysis. We can follow up with @kmpaul on this point if it would be helpful moving forward.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263652409 https://github.com/pydata/xarray/issues/463#issuecomment-263652409 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzY1MjQwOQ== shoyer 1217238 2016-11-29T18:17:17Z 2016-11-29T18:17:17Z MEMBER

@shoyer is it ever feasible to read the first NetCDF file in a sequence and assume that they are all the same except to increment a datetime dimension by increasing days?

Sure. This should probably be a different wrapper function than open_mfdataset, though, one that can make stronger assumptions. For example, one might make a wrapper function for handling NcML.

@kmpaul thanks for sharing! This is useful background.

There is at least one other option worth considering. Instead of using the open file LRU cache, a simpler option could be to add an optional argument to xarray backends (building on opener from https://github.com/pydata/xarray/pull/1128) that switches them to open/close files every time data is accessed.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263647433 https://github.com/pydata/xarray/issues/463#issuecomment-263647433 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzY0NzQzMw== kmpaul 11411331 2016-11-29T17:59:20Z 2016-11-29T17:59:20Z CONTRIBUTOR

Sorry for the delay... I saw the reference and then needed to find some time to read back over the issues to get some context.

You are correct. The PyReshaper was designed to address this type of problem, though not exactly the issue with xarray and dask. It's a pretty common problem, and it's the reason that the CESM developers are moving to long-term archival of time-series files ONLY. (In other words, PyReshaper is being incorporated into the automated CESM run-processes.) ...Of course, one could argue that this step shouldn't be necessary with some clever I/O in the models themselves to write time-series directly.

The PyReshaper opens and closes each time-slice file explicitly before and after each read, respectively. And, if fully scaled (i.e., 1 MPI process per output file), you only ever have 2 files open at a time per process. In this particular operation, the overhead associated with open/close on the input files is negligible compared to the total R/W times.

So, anyway, the PyReshaper (https://github.com/NCAR/PyReshaper) can definitely help...though I consider it a stop-gap for the moment. I'm happy to help people figure out how to get it to work for you problems, if that's a path you want to consider.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263467311 https://github.com/pydata/xarray/issues/463#issuecomment-263467311 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzQ2NzMxMQ== mrocklin 306380 2016-11-29T03:35:43Z 2016-11-29T03:35:43Z MEMBER

@shoyer is it ever feasible to read the first NetCDF file in a sequence and assume that they are all the same except to increment a datetime dimension by increasing days?

On Mon, Nov 28, 2016 at 7:19 PM, Stephan Hoyer notifications@github.com wrote:

if I understand correctly, incorporation of the LRU cache could help with this problem assuming time series were sliced into small chunks for access, correct? We would still run into problems, however, if there were say 10^6 files and we wanted to get a time-series spanning these files, right?

The LRU cache solution proposed in #798 https://github.com/pydata/xarray/issues/798 would work in either case. It just would have poor performance when accessing a small piece of each of 10^6 files, both to build the graph (because xarray needs to open each file to read the metadata) and to do the actual computation (again, because of the need to open so many files). If you only need a small amount of data from many files, you probably want to reshape your data to minimize the amount of necessary file access no matter what, whether you do that reshaping with PyReshaper or xarray/dask.array/dask-distributed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/463#issuecomment-263437709, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszK5My19y5DB7i-PBj-0L0-XM8dcKks5rC2-qgaJpZM4FWKen .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263437709 https://github.com/pydata/xarray/issues/463#issuecomment-263437709 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzQzNzcwOQ== shoyer 1217238 2016-11-29T00:19:53Z 2016-11-29T00:19:53Z MEMBER

if I understand correctly, incorporation of the LRU cache could help with this problem assuming time series were sliced into small chunks for access, correct? We would still run into problems, however, if there were say 10^6 files and we wanted to get a time-series spanning these files, right?

The LRU cache solution proposed in https://github.com/pydata/xarray/issues/798 would work in either case. It just would have poor performance when accessing a small piece of each of 10^6 files, both to build the graph (because xarray needs to open each file to read the metadata) and to do the actual computation (again, because of the need to open so many files). If you only need a small amount of data from many files, you probably want to reshape your data to minimize the amount of necessary file access no matter what, whether you do that reshaping with PyReshaper or xarray/dask.array/dask-distributed.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263418422 https://github.com/pydata/xarray/issues/463#issuecomment-263418422 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzQxODQyMg== pwolfram 4295853 2016-11-28T22:42:55Z 2016-11-28T22:43:32Z CONTRIBUTOR

We (+ @milenaveneziani and @xylar) are running into this issue again. Ideally, this should be resolved and after following up with everyone on strategy I may have another look at this issue if it sounds straightforward to fix.

@shoyer and @mrocklin, if I understand correctly, incorporation of the LRU cache could help with this problem assuming time series were sliced into small chunks for access, correct? We would still run into problems, however, if there were say 10^6 files and we wanted to get a time-series spanning these files, right? If so, we may need a more robust solution than just the LRU cache. In the short term, PyReshaper may provide a temporary solution for us. cc @kmpaul to provide some perspective here too regarding use of https://github.com/NCAR/PyReshaper.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
224049602 https://github.com/pydata/xarray/issues/463#issuecomment-224049602 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyNDA0OTYwMg== darothen 4992424 2016-06-06T18:42:06Z 2016-06-06T18:42:06Z NONE

@mangecoeur, although it's not an xarray-based solution, I've found that by far the best solution to this problem is to transform your dataset from the "timeslice" format (which is convenient for models to write out - all the data at a given point in time, often in separate files for each time step) to "timeseries" format - a continuous format, where you have all the data for a single variable in a single (or much smaller collection of) files.

NCAR published a great utility for converting batches of NetCDF output from timeslice to timeseries format here; it's significantly faster than any shell-script/CDO/NCO solution I've ever encountered, and it parallelizes extremely easily.

Adding a simple post-processing step to convert my simulation output to timeseries format dramatically reduced my overall work time. Before, I had a separate handler which re-implemented open_mfdataset(), performed an intermediate reduction (usually extracting a variable), and then concatenated within xarray. This could get around the open file limit, but it wasn't fast. My pre-processed data is often still big - barely fitting within memory - but it's far easier to handle, and you can throw dask at it no problem to get huge speedups in analysis.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223918870 https://github.com/pydata/xarray/issues/463#issuecomment-223918870 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzkxODg3MA== mangecoeur 743508 2016-06-06T10:09:48Z 2016-06-06T10:09:48Z CONTRIBUTOR

So using a cleaner minimal example it does appear that the files are closed after the dataset is closed. However, they are all open during dataset loading - this is what blows past the OSX default max open file limit.

I think this could be a real issue when using Xarray to handle too-big-for-ram datasets - you could easily be trying to access 1000s of files (especially with weather data), so Xarray should limit the number it holds open at any one time during data load. Not being familiar with the internals I'm not sure if this is an issue in Xarray itself or in the Dask backend.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223905394 https://github.com/pydata/xarray/issues/463#issuecomment-223905394 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzkwNTM5NA== mangecoeur 743508 2016-06-06T09:06:33Z 2016-06-06T09:06:33Z CONTRIBUTOR

@shoyer thanks - here's how i'm using mfdataset - not using any options. I'm going to try using the h5netcdf backend to see if I get the same results. I'm still not 100% confident that I'm tracking open files correctly with lsof so I'm going to try to make a minimal example to investigate.

``` python

def weather_dataset(root_path: Path, *, start_date: datetime = None, end_date: datetime = None): flat_files_paths = get_dset_file_paths(root_path, start_date=start_date, end_date=end_date) # Convert Paths to list of strings for xarray dataset = xr.open_mfdataset([str(f) for f in flat_files_paths]) return dataset

def cfsr_weather_loader(db, site_lookup_fn=None, dset_start=None, dset_end=None, site_conf=None): # Pull values out of the dt_conf = site_conf if site_conf else WEATHER_CFSR dset_start = dset_start if dset_start else dt_conf['start_dt'] dset_end = dset_end if dset_end else dt_conf['end_dt']

if site_lookup_fn is None:
    site_lookup_fn = site_lookup_postcode_district

def weather_loader(site_id, start_date, end_date, resample=None):
    # using the tuple because always getting mixed up with lon/lat
    geo_lookup = site_lookup_fn(site_id, db)

    # With statement should ensure dset is closed after loading.
    with weather_dataset(WEATHER_CFSR['path'],
                         start_date=dset_start,
                         end_date=dset_end) as weather:
        data = weighted_regional_timeseries(weather, start_date, end_date,
                                            lon=geo_lookup.lon,
                                            lat=geo_lookup.lat,
                                            weights=geo_lookup.weights)

    # RENAME from CFSR standard
    data = data.rename(columns=WEATHER_RENAME)

    if resample is not None:
        data = data.resample(resample).mean()
    data.irradiance /= 1000.0  # convert irradiance to kW
    return data

return weather_loader

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223838593 https://github.com/pydata/xarray/issues/463#issuecomment-223838593 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzgzODU5Mw== shoyer 1217238 2016-06-05T21:23:41Z 2016-06-05T21:23:41Z MEMBER

@mangecoeur I can take a look. Can you share an example of how you use the with block? Are you using any special options to open_mfdataset?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223837612 https://github.com/pydata/xarray/issues/463#issuecomment-223837612 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzgzNzYxMg== mangecoeur 743508 2016-06-05T21:05:40Z 2016-06-05T21:05:40Z CONTRIBUTOR

So on investigation, even though my dataset creation is wrapped in a with block, using lsof to check the file handles held by my iPython kernel suggests that all the input files are still open. Are you certain that the backend correctly closes files in a multifile dataset? Is there a way to explicitly force this to happen?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223810723 https://github.com/pydata/xarray/issues/463#issuecomment-223810723 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzgxMDcyMw== mangecoeur 743508 2016-06-05T12:34:11Z 2016-06-05T12:34:11Z CONTRIBUTOR

I still hit this issue after wrapping my open_mfdataset in a with statement. I'm suspecting to be an OSX problem, MacOS has a very low default max-open-files limit for applications started from the shell (like 256). It's not yet clear to me whether my datasets are being correctly closed, investigating...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223687053 https://github.com/pydata/xarray/issues/463#issuecomment-223687053 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzY4NzA1Mw== mangecoeur 743508 2016-06-03T20:31:56Z 2016-06-03T20:31:56Z CONTRIBUTOR

It seems to happen even with a freshly restarted notebook, but I'll try a with statement to see if helps. On 3 Jun 2016 19:53, "Stephan Hoyer" notifications@github.com wrote:

I suspect you hit this in IPython after rerunning cells, because file handles are only automatically closed when programs exit. You might find it a good idea to explicitly close files by calling .close() (or using a "with" statement) on Datasets opened with open_mfdataset.

On Fri, Jun 3, 2016 at 11:08 AM, mangecoeur notifications@github.com wrote:

I'm also running into this error - but strangely it only happens when using IPython interactive backend. I have some tests which work fine, but doing the same in IPython fails.

I'm opening a few hundred files (about 10Mb each, one per month across a few variables). I'm using the default NetCDF backend.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/463#issuecomment-223651454, or mute the thread < https://github.com/notifications/unsubscribe/ABKS1sOTvuTtWVVFM7tnP7tnuGKvI-MBks5qIG2YgaJpZM4FWKen

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/463#issuecomment-223663026, or mute the thread https://github.com/notifications/unsubscribe/AAtYVCtspqRb0AXy1ilbgoRuZN_syEDvks5qIHglgaJpZM4FWKen .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223663026 https://github.com/pydata/xarray/issues/463#issuecomment-223663026 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzY2MzAyNg== shoyer 1217238 2016-06-03T18:53:22Z 2016-06-03T18:53:22Z MEMBER

I suspect you hit this in IPython after rerunning cells, because file handles are only automatically closed when programs exit. You might find it a good idea to explicitly close files by calling .close() (or using a "with" statement) on Datasets opened with open_mfdataset.

On Fri, Jun 3, 2016 at 11:08 AM, mangecoeur notifications@github.com wrote:

I'm also running into this error - but strangely it only happens when using IPython interactive backend. I have some tests which work fine, but doing the same in IPython fails.

I'm opening a few hundred files (about 10Mb each, one per month across a few variables). I'm using the default NetCDF backend.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/463#issuecomment-223651454, or mute the thread https://github.com/notifications/unsubscribe/ABKS1sOTvuTtWVVFM7tnP7tnuGKvI-MBks5qIG2YgaJpZM4FWKen .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223651454 https://github.com/pydata/xarray/issues/463#issuecomment-223651454 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzY1MTQ1NA== mangecoeur 743508 2016-06-03T18:08:24Z 2016-06-03T18:08:24Z CONTRIBUTOR

I'm also running into this error - but strangely it only happens when using IPython interactive backend. I have some tests which work fine, but doing the same in IPython fails.

I'm opening a few hundred files (about 10Mb each, one per month across a few variables). I'm using the default NetCDF backend.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
143382040 https://github.com/pydata/xarray/issues/463#issuecomment-143382040 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MzM4MjA0MA== shoyer 1217238 2015-09-26T00:22:51Z 2015-09-26T00:22:51Z MEMBER

OK, I think you could also just add an ensured_open() to the repr() method. Right now that class is inheriting it from NDArrayMixin.

On Fri, Sep 25, 2015 at 5:11 PM, Christoph Paulik notifications@github.com wrote:

OK, I'll try. Thanks. But I originally tested if netCDF4 can work with a closed/reopened variable like this:

``` python In [1]: import netCDF4 In [2]: a = netCDF4.Dataset("temp.nc", mode="w") In [3]: a.createDimension("lon") Out[3]: <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'lon', size = 0 In [4]: a.createVariable("lon", "f8", dimensions=("lon")) Out[4]: <class 'netCDF4._netCDF4.Variable'> float64 lon(lon) unlimited dimensions: lon current shape = (0,) filling on, default FillValue of 9.969209968386869e+36 used In [5]: v = a.variables['lon'] In [6]: v Out[6]: <class 'netCDF4._netCDF4.Variable'> float64 lon(lon) unlimited dimensions: lon current shape = (0,) filling on, default _FillValue of 9.969209968386869e+36 used In [7]: a.close() In [8]: v Out[8]: --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/core/formatters.py in __call__(self, obj) 695 type_pprinters=self.type_printers, 696 deferred_pprinters=self.deferred_printers) --> 697 printer.pretty(obj) 698 printer.flush() 699 return stream.getvalue() /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in pretty(self, obj) 381 if callable(meth): 382 return meth(obj, self, cycle) --> 383 return _default_pprint(obj, self, cycle) 384 finally: 385 self.end_group() /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle) 501 if _safe_getattr(klass, '__repr__', None) not in _baseclass_reprs: 502 # A user-provided repr. Find newlines and replace them with p.break() --> 503 repr_pprint(obj, p, cycle) 504 return 505 p.begin_group(1, '<') /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle) 683 """A pprint that just redirects to the normal repr function.""" 684 # Find newlines and replace them with p.break() --> 685 output = repr(obj) 686 for idx,output_line in enumerate(output.splitlines()): 687 if idx: netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.repr (netCDF4/_netCDF4.c:25045)() netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.unicode (netCDF4/_netCDF4.c:25243)() netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.dimensions.get (netCDF4/_netCDF4.c:27486)() netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._getdims (netCDF4/_netCDF4.c:26297)() RuntimeError: NetCDF: Not a valid ID In [9]: a = netCDF4.Dataset("temp.nc") In [10]: v Out[10]: class 'netCDF4._netCDF4.Variable'> lon(lon) dimensions: lon shape = (0,) on, default _FillValue of 9.969209968386869e+36 used


Reply to this email directly or view it on GitHub: https://github.com/xray/xray/issues/463#issuecomment-143373357 ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
143373357 https://github.com/pydata/xarray/issues/463#issuecomment-143373357 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MzM3MzM1Nw== cpaulik 380927 2015-09-25T23:11:39Z 2015-09-25T23:11:39Z NONE

OK, I'll try. Thanks.

But I originally tested if netCDF4 can work with a closed/reopened variable like this:

``` python In [1]: import netCDF4

In [2]: a = netCDF4.Dataset("temp.nc", mode="w")

In [3]: a.createDimension("lon") Out[3]: <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'lon', size = 0

In [4]: a.createVariable("lon", "f8", dimensions=("lon")) Out[4]: <class 'netCDF4._netCDF4.Variable'> float64 lon(lon) unlimited dimensions: lon current shape = (0,) filling on, default _FillValue of 9.969209968386869e+36 used

In [5]: v = a.variables['lon']

In [6]: v Out[6]: <class 'netCDF4._netCDF4.Variable'> float64 lon(lon) unlimited dimensions: lon current shape = (0,) filling on, default _FillValue of 9.969209968386869e+36 used

In [7]: a.close()

In [8]: v Out[8]: --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/core/formatters.py in call(self, obj) 695 type_pprinters=self.type_printers, 696 deferred_pprinters=self.deferred_printers) --> 697 printer.pretty(obj) 698 printer.flush() 699 return stream.getvalue() /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in pretty(self, obj) 381 if callable(meth): 382 return meth(obj, self, cycle) --> 383 return default_pprint(obj, self, cycle) 384 finally: 385 self.end_group() /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle) 501 if _safe_getattr(klass, '__repr__', None) not in _baseclass_reprs: 502 # A user-provided repr. Find newlines and replace them with p.break() --> 503 repr_pprint(obj, p, cycle) 504 return 505 p.begin_group(1, '<') /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle) 683 """A pprint that just redirects to the normal repr function.""" 684 # Find newlines and replace them with p.break() --> 685 output = repr(obj) 686 for idx,output_line in enumerate(output.splitlines()): 687 if idx: netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.repr (netCDF4/_netCDF4.c:25045)() netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.unicode (netCDF4/_netCDF4.c:25243)() netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.dimensions.get (netCDF4/_netCDF4.c:27486)() netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._getdims (netCDF4/_netCDF4.c:26297)() RuntimeError: NetCDF: Not a valid ID In [9]: a = netCDF4.Dataset("temp.nc") In [10]: v Out[10]: class 'netCDF4._netCDF4.Variable'> lon(lon) dimensions: lon shape = (0,) on, default _FillValue of 9.969209968386869e+36 used ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
143347373 https://github.com/pydata/xarray/issues/463#issuecomment-143347373 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MzM0NzM3Mw== shoyer 1217238 2015-09-25T20:35:38Z 2015-09-25T20:35:38Z MEMBER

OK, so the problem is that self.array on NetCDF4ArrayWrapper is retaining a reference to netCDF4.Variable object on the closed dataset. It's not enough to merely ensure that a netCDF4 dataset is opened -- you also need to ensure that no references to variables on the old dataset are still around. So get_variables/open_store_variable may need a refactor to deal with this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
143338384 https://github.com/pydata/xarray/issues/463#issuecomment-143338384 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MzMzODM4NA== cpaulik 380927 2015-09-25T20:02:42Z 2015-09-25T20:02:42Z NONE

I've only put the try - except there to conditionally set the breakpoint. How does it make a difference if the self.store.close is called? It it is not called then the dataset remains opened which should not cause the weird behaviour reported above?

Nevertheless I have updated my branch to use a contextmanager because it is a better solution but I still have this strange behaviour of only printing the variable altering the test outcome.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
143325053 https://github.com/pydata/xarray/issues/463#issuecomment-143325053 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MzMyNTA1Mw== shoyer 1217238 2015-09-25T19:06:51Z 2015-09-25T19:06:51Z MEMBER

@cpaulik I wonder if the issue is this section in your __getitem__ method:

python data = getitem(self.array, key) try: self.store.ensure_open() data = getitem(self.array, key) except RuntimeError as e: raise e pass if self.ndim == 0: # work around for netCDF4-python's broken handling of 0-d # arrays (slicing them always returns a 1-dimensional array): # https://github.com/Unidata/netcdf4-python/pull/220 data = np.asscalar(data) self.store.close() return data

I would put self.store.close() in a finally clause following the getitem clause.

Actually, you probably want to put this in a context manager that automatically closes the file, something like:

python with self.store.opened(): data = getitem(self.array, key)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
143222580 https://github.com/pydata/xarray/issues/463#issuecomment-143222580 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MzIyMjU4MA== cpaulik 380927 2015-09-25T13:27:59Z 2015-09-25T13:27:59Z NONE

I've pushed a few commits trying this out to https://github.com/cpaulik/xray/tree/closing_netcdf_backend . I can open a WIP PR if this would be easier to discuss there.

There are however a few tests that keep failing and I can not figure out why.

e.g.: test_backends.py::NetCDF4ViaDaskDataTest::test_compression_encoding:

If I set a breakpoint at line 941 of dataset.py and just continue the test fails.

If I however evaluate self.variables.items() or even self.variables at the breakpoint I get the correct output and the test passes when continued. I can not really see the difference between me evaluating this in ipdb and the code that is on the line.

The error I get when running the test without interference is:

``` shell test_backends.py::NetCDF4ViaDaskDataTest::test_compression_encoding FAILED

====================================================== FAILURES ======================================================= ______ NetCDF4ViaDaskDataTest.test_compression_encoding _________

self = <xray.test.test_backends.NetCDF4ViaDaskDataTest testMethod=test_compression_encoding>

def test_compression_encoding(self):
    data = create_test_data()
    data['var2'].encoding.update({'zlib': True,
                                  'chunksizes': (5, 5),
                                  'fletcher32': True})
  with self.roundtrip(data) as actual:

test_backends.py:502:


/usr/lib/python2.7/contextlib.py:17: in enter return self.gen.next() test_backends.py:596: in roundtrip yield ds.chunk() ../core/dataset.py:942: in chunk for k, v in self.variables.items()]) ../core/dataset.py:935: in maybe_chunk token2 = tokenize(name, token if token else var._data) /home/cpa/.virtualenvs/xray/local/lib/python2.7/site-packages/dask/base.py:152: in tokenize return md5(str(tuple(map(normalize_token, args))).encode()).hexdigest() ../core/indexing.py:301: in repr (type(self).name, self.array, self.key)) ../core/utils.py:377: in repr return '%s(array=%r)' % (type(self).name, self.array) ../core/indexing.py:301: in repr (type(self).name, self.array, self.key)) ../core/utils.py:377: in repr return '%s(array=%r)' % (type(self).name, self.array) netCDF4/_netCDF4.pyx:2931: in netCDF4._netCDF4.Variable.repr (netCDF4/_netCDF4.c:25068) ??? netCDF4/_netCDF4.pyx:2938: in netCDF4._netCDF4.Variable.unicode (netCDF4/_netCDF4.c:25243) ??? netCDF4/_netCDF4.pyx:3059: in netCDF4._netCDF4.Variable.dimensions.get (netCDF4/_netCDF4.c:27486) ???


??? E RuntimeError: NetCDF: Not a valid ID

netCDF4/_netCDF4.pyx:2994: RuntimeError ============================================== 1 failed in 0.50 seconds =============================================== ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
142675701 https://github.com/pydata/xarray/issues/463#issuecomment-142675701 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MjY3NTcwMQ== shoyer 1217238 2015-09-23T17:41:49Z 2015-09-23T17:41:49Z MEMBER

I think we can actually read all the variable metadata (shape and dtype) in when we open the file -- we already do that for reading in attributes. Something like this prototype, which would also be useful for reading compressed netCDF4 files with multiprocessing: https://github.com/blaze/dask/pull/457#issuecomment-123512166

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
142637232 https://github.com/pydata/xarray/issues/463#issuecomment-142637232 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MjYzNzIzMg== cpaulik 380927 2015-09-23T15:19:36Z 2015-09-23T15:19:36Z NONE

I've run into the same problem and have been looking at the netCDF backend. A solution does not seem to be so easy as to open and close the file in the __getitem__ method since this closes the file also for any other access e.g. attributes like shape or dtype.

Short of decorating all the functions of the netCDF4 package I can not think of a workable solution to this. But maybe I'm overlooking something fundamental.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
120668247 https://github.com/pydata/xarray/issues/463#issuecomment-120668247 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDEyMDY2ODI0Nw== rabernat 1197350 2015-07-11T23:01:38Z 2015-07-11T23:01:38Z MEMBER

8 MB. This is daily satellite data, with one file per time point. (Most satellite data is distributed this way.)

There are many other workarounds to this problem. You can try to increase your ulimits. Or you can join these small netcdf files together into a big one. I had daily data files, and I used NCO to concatentate them into monthly files. That basically solved my problem. But of course that involves going out of xray.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
120666380 https://github.com/pydata/xarray/issues/463#issuecomment-120666380 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDEyMDY2NjM4MA== shoyer 1217238 2015-07-11T22:36:30Z 2015-07-11T22:36:30Z MEMBER

Hmm. How big are each of your netCDF files?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
120662901 https://github.com/pydata/xarray/issues/463#issuecomment-120662901 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDEyMDY2MjkwMQ== rabernat 1197350 2015-07-11T21:37:42Z 2015-07-11T21:37:42Z MEMBER

I came up with a solution for this, but it is so slow that it is useless.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
120449743 https://github.com/pydata/xarray/issues/463#issuecomment-120449743 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDEyMDQ0OTc0Mw== rabernat 1197350 2015-07-10T16:19:15Z 2015-07-10T16:19:15Z MEMBER

Ok, I will have a look at this. I would be happy to contribute to this awesome project.

By the way, by monitoring /proc, I was able to see that the scipy backend actually opens each file TWICE, exacerbating the problem.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
120448308 https://github.com/pydata/xarray/issues/463#issuecomment-120448308 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDEyMDQ0ODMwOA== shoyer 1217238 2015-07-10T16:12:52Z 2015-07-10T16:12:52Z MEMBER

Sure, you could do this on the scipy backend -- the logic will be essentially the same on both backends.

I believe your issue with netCDF4 backend is the same as this one: https://github.com/xray/xray/issues/444. This will be fixed in the next release.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
120446569 https://github.com/pydata/xarray/issues/463#issuecomment-120446569 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDEyMDQ0NjU2OQ== rabernat 1197350 2015-07-10T16:08:48Z 2015-07-10T16:08:48Z MEMBER

I am using the scipy backend because the netcdf4 backend doesn't work for me at all. It core dumps with the error

python: posixio.c:366: px_rel: Assertion `pxp->bf_offset <= offset && offset < pxp->bf_offset + (off_t) pxp->bf_extent' failed. Aborted (core dumped)

Are you suggesting I work on the scipy backend?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
120443929 https://github.com/pydata/xarray/issues/463#issuecomment-120443929 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDEyMDQ0MzkyOQ== shoyer 1217238 2015-07-10T15:58:41Z 2015-07-10T15:58:41Z MEMBER

Yes, this is a known issue, and I agree that it is annoying. We could work around this by opening up (and closing) netCDF files inside the __getitem__ call. If you're interested in possibly working on this, take a look at the netCDF4 backend for xray: https://github.com/xray/xray/blob/master/xray/backends/netCDF4_.py

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
120442769 https://github.com/pydata/xarray/issues/463#issuecomment-120442769 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDEyMDQ0Mjc2OQ== rabernat 1197350 2015-07-10T15:53:48Z 2015-07-10T15:53:48Z MEMBER

Just a little follow up...I tried to work around the file limit by serializing the processing of the files and creating xray datasets with with fewer files in them. However, I still eventually hit this error, suggesting that the files are never being closed. For example

I would like to do

python ds = xray.open_mfdataset(ddir + '*.nc' % yr, engine='scipy') EKE = (ds.variables['u']**2 + ds.variables['v']**2).mean(dim='time').load()

This tries to open 8031 files and produces the error: [Errno 24] Too many open files

So then I try to create a new dataset for each year

python EKE = [] for yr in xrange(1993,2015): print yr # this opens about 365 files ds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_%04d*.nc' % yr, engine='scipy') EKE.append((ds.variables['u']**2 + ds.variables['v']**2).mean(dim='time').load())

This works okay for the first two years. However, by the third year, I still get the error: [Errno 24] Too many open files. This is when the ulimit of 1024 files is exceeded.

Using xray version 0.5.1 via conda module.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 52.837ms · About: xarray-datasette