home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

11 rows where author_association = "NONE" and issue = 94328498 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • cpaulik 4
  • ajoros 3
  • sebhahn 3
  • darothen 1

issue 1

  • open_mfdataset too many files · 11 ✖

author_association 1

  • NONE · 11 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
347165242 https://github.com/pydata/xarray/issues/463#issuecomment-347165242 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDM0NzE2NTI0Mg== sebhahn 5929935 2017-11-27T12:17:17Z 2017-11-27T12:17:17Z NONE

Thanks, I'll test it!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
347140117 https://github.com/pydata/xarray/issues/463#issuecomment-347140117 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDM0NzE0MDExNw== sebhahn 5929935 2017-11-27T10:26:56Z 2017-11-27T10:26:56Z NONE

Ok, I found my problem. I had to increase ulimit -n

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
347126256 https://github.com/pydata/xarray/issues/463#issuecomment-347126256 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDM0NzEyNjI1Ng== sebhahn 5929935 2017-11-27T09:33:29Z 2017-11-27T09:33:29Z NONE

@shoyer I just ran into this issue again (with 8000 files, each 50 kB), I'm using xarray 0.9.6 and work on some performance tests. Is there any upper limit of number of files?

File "/home/shahn/.pyenv/versions/warp_conda/envs/pyraster_env/lib/python2.7/site-packages/xarray/backends/api.py", line 505, in open_mfdataset File "/home/shahn/.pyenv/versions/warp_conda/envs/pyraster_env/lib/python2.7/site-packages/xarray/backends/api.py", line 282, in open_dataset File "/home/shahn/.pyenv/versions/warp_conda/envs/pyraster_env/lib/python2.7/site-packages/xarray/backends/netCDF4_.py", line 210, in __init__ File "/home/shahn/.pyenv/versions/warp_conda/envs/pyraster_env/lib/python2.7/site-packages/xarray/backends/netCDF4_.py", line 185, in _open_netcdf4_group File "netCDF4/_netCDF4.pyx", line 1811, in netCDF4._netCDF4.Dataset.__init__ (netCDF4/_netCDF4.c:13231) IOError: Too many open files

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288868053 https://github.com/pydata/xarray/issues/463#issuecomment-288868053 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODg2ODA1Mw== ajoros 2615433 2017-03-23T21:37:19Z 2017-03-23T21:37:19Z NONE

Yessir @pwolfram we are in business.!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288835940 https://github.com/pydata/xarray/issues/463#issuecomment-288835940 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODgzNTk0MA== ajoros 2615433 2017-03-23T19:34:33Z 2017-03-23T19:34:33Z NONE

Thanks @pwolfram ... shot you a follow up email at your Gmail...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288829145 https://github.com/pydata/xarray/issues/463#issuecomment-288829145 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODgyOTE0NQ== ajoros 2615433 2017-03-23T19:08:37Z 2017-03-23T19:08:37Z NONE

Not sure this is good feedback at all but I just wanted to provide an additional problematic case, from my end, that is returning this "too many files" problem:

NOTE: I have the latest xarray package. I have about 365 1.7MB Netcdf files that I am trying to read using open_mfdataset() and it continuously gives me the "too many files" error and completely hangs jupyter notebooks to the point where I have to ctrl+C out of it. Note that each netcdf contains a Dataset that is 195x195x1. Obviously it's not a file-size issue as I'm not dealing with multiple gigs worth of data. Should I increase the OSX open max file limit, or will that not solve anything in my case?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
224049602 https://github.com/pydata/xarray/issues/463#issuecomment-224049602 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyNDA0OTYwMg== darothen 4992424 2016-06-06T18:42:06Z 2016-06-06T18:42:06Z NONE

@mangecoeur, although it's not an xarray-based solution, I've found that by far the best solution to this problem is to transform your dataset from the "timeslice" format (which is convenient for models to write out - all the data at a given point in time, often in separate files for each time step) to "timeseries" format - a continuous format, where you have all the data for a single variable in a single (or much smaller collection of) files.

NCAR published a great utility for converting batches of NetCDF output from timeslice to timeseries format here; it's significantly faster than any shell-script/CDO/NCO solution I've ever encountered, and it parallelizes extremely easily.

Adding a simple post-processing step to convert my simulation output to timeseries format dramatically reduced my overall work time. Before, I had a separate handler which re-implemented open_mfdataset(), performed an intermediate reduction (usually extracting a variable), and then concatenated within xarray. This could get around the open file limit, but it wasn't fast. My pre-processed data is often still big - barely fitting within memory - but it's far easier to handle, and you can throw dask at it no problem to get huge speedups in analysis.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
143373357 https://github.com/pydata/xarray/issues/463#issuecomment-143373357 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MzM3MzM1Nw== cpaulik 380927 2015-09-25T23:11:39Z 2015-09-25T23:11:39Z NONE

OK, I'll try. Thanks.

But I originally tested if netCDF4 can work with a closed/reopened variable like this:

``` python In [1]: import netCDF4

In [2]: a = netCDF4.Dataset("temp.nc", mode="w")

In [3]: a.createDimension("lon") Out[3]: <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'lon', size = 0

In [4]: a.createVariable("lon", "f8", dimensions=("lon")) Out[4]: <class 'netCDF4._netCDF4.Variable'> float64 lon(lon) unlimited dimensions: lon current shape = (0,) filling on, default _FillValue of 9.969209968386869e+36 used

In [5]: v = a.variables['lon']

In [6]: v Out[6]: <class 'netCDF4._netCDF4.Variable'> float64 lon(lon) unlimited dimensions: lon current shape = (0,) filling on, default _FillValue of 9.969209968386869e+36 used

In [7]: a.close()

In [8]: v Out[8]: --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/core/formatters.py in call(self, obj) 695 type_pprinters=self.type_printers, 696 deferred_pprinters=self.deferred_printers) --> 697 printer.pretty(obj) 698 printer.flush() 699 return stream.getvalue() /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in pretty(self, obj) 381 if callable(meth): 382 return meth(obj, self, cycle) --> 383 return default_pprint(obj, self, cycle) 384 finally: 385 self.end_group() /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle) 501 if _safe_getattr(klass, '__repr__', None) not in _baseclass_reprs: 502 # A user-provided repr. Find newlines and replace them with p.break() --> 503 repr_pprint(obj, p, cycle) 504 return 505 p.begin_group(1, '<') /home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle) 683 """A pprint that just redirects to the normal repr function.""" 684 # Find newlines and replace them with p.break() --> 685 output = repr(obj) 686 for idx,output_line in enumerate(output.splitlines()): 687 if idx: netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.repr (netCDF4/_netCDF4.c:25045)() netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.unicode (netCDF4/_netCDF4.c:25243)() netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.dimensions.get (netCDF4/_netCDF4.c:27486)() netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._getdims (netCDF4/_netCDF4.c:26297)() RuntimeError: NetCDF: Not a valid ID In [9]: a = netCDF4.Dataset("temp.nc") In [10]: v Out[10]: class 'netCDF4._netCDF4.Variable'> lon(lon) dimensions: lon shape = (0,) on, default _FillValue of 9.969209968386869e+36 used ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
143338384 https://github.com/pydata/xarray/issues/463#issuecomment-143338384 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MzMzODM4NA== cpaulik 380927 2015-09-25T20:02:42Z 2015-09-25T20:02:42Z NONE

I've only put the try - except there to conditionally set the breakpoint. How does it make a difference if the self.store.close is called? It it is not called then the dataset remains opened which should not cause the weird behaviour reported above?

Nevertheless I have updated my branch to use a contextmanager because it is a better solution but I still have this strange behaviour of only printing the variable altering the test outcome.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
143222580 https://github.com/pydata/xarray/issues/463#issuecomment-143222580 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MzIyMjU4MA== cpaulik 380927 2015-09-25T13:27:59Z 2015-09-25T13:27:59Z NONE

I've pushed a few commits trying this out to https://github.com/cpaulik/xray/tree/closing_netcdf_backend . I can open a WIP PR if this would be easier to discuss there.

There are however a few tests that keep failing and I can not figure out why.

e.g.: test_backends.py::NetCDF4ViaDaskDataTest::test_compression_encoding:

If I set a breakpoint at line 941 of dataset.py and just continue the test fails.

If I however evaluate self.variables.items() or even self.variables at the breakpoint I get the correct output and the test passes when continued. I can not really see the difference between me evaluating this in ipdb and the code that is on the line.

The error I get when running the test without interference is:

``` shell test_backends.py::NetCDF4ViaDaskDataTest::test_compression_encoding FAILED

====================================================== FAILURES ======================================================= ______ NetCDF4ViaDaskDataTest.test_compression_encoding _________

self = <xray.test.test_backends.NetCDF4ViaDaskDataTest testMethod=test_compression_encoding>

def test_compression_encoding(self):
    data = create_test_data()
    data['var2'].encoding.update({'zlib': True,
                                  'chunksizes': (5, 5),
                                  'fletcher32': True})
  with self.roundtrip(data) as actual:

test_backends.py:502:


/usr/lib/python2.7/contextlib.py:17: in enter return self.gen.next() test_backends.py:596: in roundtrip yield ds.chunk() ../core/dataset.py:942: in chunk for k, v in self.variables.items()]) ../core/dataset.py:935: in maybe_chunk token2 = tokenize(name, token if token else var._data) /home/cpa/.virtualenvs/xray/local/lib/python2.7/site-packages/dask/base.py:152: in tokenize return md5(str(tuple(map(normalize_token, args))).encode()).hexdigest() ../core/indexing.py:301: in repr (type(self).name, self.array, self.key)) ../core/utils.py:377: in repr return '%s(array=%r)' % (type(self).name, self.array) ../core/indexing.py:301: in repr (type(self).name, self.array, self.key)) ../core/utils.py:377: in repr return '%s(array=%r)' % (type(self).name, self.array) netCDF4/_netCDF4.pyx:2931: in netCDF4._netCDF4.Variable.repr (netCDF4/_netCDF4.c:25068) ??? netCDF4/_netCDF4.pyx:2938: in netCDF4._netCDF4.Variable.unicode (netCDF4/_netCDF4.c:25243) ??? netCDF4/_netCDF4.pyx:3059: in netCDF4._netCDF4.Variable.dimensions.get (netCDF4/_netCDF4.c:27486) ???


??? E RuntimeError: NetCDF: Not a valid ID

netCDF4/_netCDF4.pyx:2994: RuntimeError ============================================== 1 failed in 0.50 seconds =============================================== ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
142637232 https://github.com/pydata/xarray/issues/463#issuecomment-142637232 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDE0MjYzNzIzMg== cpaulik 380927 2015-09-23T15:19:36Z 2015-09-23T15:19:36Z NONE

I've run into the same problem and have been looking at the netCDF backend. A solution does not seem to be so easy as to open and close the file in the __getitem__ method since this closes the file also for any other access e.g. attributes like shape or dtype.

Short of decorating all the functions of the netCDF4 package I can not think of a workable solution to this. But maybe I'm overlooking something fundamental.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.862ms · About: xarray-datasette