home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

15 rows where author_association = "CONTRIBUTOR" and issue = 94328498 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • pwolfram 8
  • mangecoeur 6
  • kmpaul 1

issue 1

  • open_mfdataset too many files · 15 ✖

author_association 1

  • CONTRIBUTOR · 15 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
288867744 https://github.com/pydata/xarray/issues/463#issuecomment-288867744 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODg2Nzc0NA== pwolfram 4295853 2017-03-23T21:36:07Z 2017-03-23T21:36:07Z CONTRIBUTOR

@ajoros should correct me if I'm wrong but it sounds like everything is working for his use case.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288832707 https://github.com/pydata/xarray/issues/463#issuecomment-288832707 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODgzMjcwNw== pwolfram 4295853 2017-03-23T19:21:57Z 2017-03-23T19:21:57Z CONTRIBUTOR

@ajoros, #1198 was just merged so the bleeding-edge version of xarray is the one to try!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288830741 https://github.com/pydata/xarray/issues/463#issuecomment-288830741 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODgzMDc0MQ== pwolfram 4295853 2017-03-23T19:14:23Z 2017-03-23T19:14:23Z CONTRIBUTOR

@ajoros, can you try something like pip -v install --force git+ssh://git@github.com/pwolfram/xarray@fix_too_many_open_files to see if #1198 fixes your problem with your dataset, noting that you need open_mfdataset(..., autoclose=True)?

@shoyer should correct me if I'm wrong but we are almost ready to merge the code in this PR and this would be a great "in the field" check if you could try it out soon.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
288414991 https://github.com/pydata/xarray/issues/463#issuecomment-288414991 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI4ODQxNDk5MQ== pwolfram 4295853 2017-03-22T14:25:37Z 2017-03-22T14:25:37Z CONTRIBUTOR

We are very close on #1198 and will be merging soon. This would be a great time for everyone to ensure that #1198 resolves this issue before we merge.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263723460 https://github.com/pydata/xarray/issues/463#issuecomment-263723460 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzcyMzQ2MA== pwolfram 4295853 2016-11-29T22:39:25Z 2016-11-29T23:30:59Z CONTRIBUTOR

I just realized I didn't say thank you to @shoyer et al for the advice and help. Please forgive my rudeness.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263721589 https://github.com/pydata/xarray/issues/463#issuecomment-263721589 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzcyMTU4OQ== pwolfram 4295853 2016-11-29T22:31:25Z 2016-11-29T22:31:25Z CONTRIBUTOR

@shoyer, if I understand correctly the best approach as you see it to build on opener via #1128, recognizing this will be essentially "upgraded" sometime in the future, right?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263693540 https://github.com/pydata/xarray/issues/463#issuecomment-263693540 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzY5MzU0MA== pwolfram 4295853 2016-11-29T20:46:20Z 2016-11-29T20:47:30Z CONTRIBUTOR

@shoyer, you probably have the very best feel for what the most efficacious solution is to this problem in terms of fixing the issue, performance, longer utility, etc. Is there any clear winner from the following potentially non-exhaustive options?

  1. LRU cache from #798
  2. Building on opener #1128
  3. New wrapper functionality as discussed above for NcML
  4. Use of PyReshaper (e.g., short term acknowledgement that change to xarray / dask may be somewhat out of scope for current design goals)

My current analysis:

I could see our team using PyReshaper because our data output format already has inertia but this adds complexity to a workflow that intuitively should be handled inside xarray. However, I think we want to get around the file number limitation eventually because it is an issue that multiple groups keep bringing up. This is perhaps the simplest solution but it is specific to our uses and not necessarily general. Towards a general solution, we would intuitively have a fixed cost performance penalty for the opener solution but it may be the simplest and cleanest approach, at least for the short term. However, we may need the LRU cache eventually to bridge xarray / dask-distributed so implementation of opener could be a depreciated effort in the long term. The NcML approach has the flavor of a solution along the lines of PyReshaper, although my limited experience with PyReshaper and NcML precludes a more rigorous analysis. We can follow up with @kmpaul on this point if it would be helpful moving forward.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263647433 https://github.com/pydata/xarray/issues/463#issuecomment-263647433 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzY0NzQzMw== kmpaul 11411331 2016-11-29T17:59:20Z 2016-11-29T17:59:20Z CONTRIBUTOR

Sorry for the delay... I saw the reference and then needed to find some time to read back over the issues to get some context.

You are correct. The PyReshaper was designed to address this type of problem, though not exactly the issue with xarray and dask. It's a pretty common problem, and it's the reason that the CESM developers are moving to long-term archival of time-series files ONLY. (In other words, PyReshaper is being incorporated into the automated CESM run-processes.) ...Of course, one could argue that this step shouldn't be necessary with some clever I/O in the models themselves to write time-series directly.

The PyReshaper opens and closes each time-slice file explicitly before and after each read, respectively. And, if fully scaled (i.e., 1 MPI process per output file), you only ever have 2 files open at a time per process. In this particular operation, the overhead associated with open/close on the input files is negligible compared to the total R/W times.

So, anyway, the PyReshaper (https://github.com/NCAR/PyReshaper) can definitely help...though I consider it a stop-gap for the moment. I'm happy to help people figure out how to get it to work for you problems, if that's a path you want to consider.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
263418422 https://github.com/pydata/xarray/issues/463#issuecomment-263418422 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDI2MzQxODQyMg== pwolfram 4295853 2016-11-28T22:42:55Z 2016-11-28T22:43:32Z CONTRIBUTOR

We (+ @milenaveneziani and @xylar) are running into this issue again. Ideally, this should be resolved and after following up with everyone on strategy I may have another look at this issue if it sounds straightforward to fix.

@shoyer and @mrocklin, if I understand correctly, incorporation of the LRU cache could help with this problem assuming time series were sliced into small chunks for access, correct? We would still run into problems, however, if there were say 10^6 files and we wanted to get a time-series spanning these files, right? If so, we may need a more robust solution than just the LRU cache. In the short term, PyReshaper may provide a temporary solution for us. cc @kmpaul to provide some perspective here too regarding use of https://github.com/NCAR/PyReshaper.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223918870 https://github.com/pydata/xarray/issues/463#issuecomment-223918870 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzkxODg3MA== mangecoeur 743508 2016-06-06T10:09:48Z 2016-06-06T10:09:48Z CONTRIBUTOR

So using a cleaner minimal example it does appear that the files are closed after the dataset is closed. However, they are all open during dataset loading - this is what blows past the OSX default max open file limit.

I think this could be a real issue when using Xarray to handle too-big-for-ram datasets - you could easily be trying to access 1000s of files (especially with weather data), so Xarray should limit the number it holds open at any one time during data load. Not being familiar with the internals I'm not sure if this is an issue in Xarray itself or in the Dask backend.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223905394 https://github.com/pydata/xarray/issues/463#issuecomment-223905394 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzkwNTM5NA== mangecoeur 743508 2016-06-06T09:06:33Z 2016-06-06T09:06:33Z CONTRIBUTOR

@shoyer thanks - here's how i'm using mfdataset - not using any options. I'm going to try using the h5netcdf backend to see if I get the same results. I'm still not 100% confident that I'm tracking open files correctly with lsof so I'm going to try to make a minimal example to investigate.

``` python

def weather_dataset(root_path: Path, *, start_date: datetime = None, end_date: datetime = None): flat_files_paths = get_dset_file_paths(root_path, start_date=start_date, end_date=end_date) # Convert Paths to list of strings for xarray dataset = xr.open_mfdataset([str(f) for f in flat_files_paths]) return dataset

def cfsr_weather_loader(db, site_lookup_fn=None, dset_start=None, dset_end=None, site_conf=None): # Pull values out of the dt_conf = site_conf if site_conf else WEATHER_CFSR dset_start = dset_start if dset_start else dt_conf['start_dt'] dset_end = dset_end if dset_end else dt_conf['end_dt']

if site_lookup_fn is None:
    site_lookup_fn = site_lookup_postcode_district

def weather_loader(site_id, start_date, end_date, resample=None):
    # using the tuple because always getting mixed up with lon/lat
    geo_lookup = site_lookup_fn(site_id, db)

    # With statement should ensure dset is closed after loading.
    with weather_dataset(WEATHER_CFSR['path'],
                         start_date=dset_start,
                         end_date=dset_end) as weather:
        data = weighted_regional_timeseries(weather, start_date, end_date,
                                            lon=geo_lookup.lon,
                                            lat=geo_lookup.lat,
                                            weights=geo_lookup.weights)

    # RENAME from CFSR standard
    data = data.rename(columns=WEATHER_RENAME)

    if resample is not None:
        data = data.resample(resample).mean()
    data.irradiance /= 1000.0  # convert irradiance to kW
    return data

return weather_loader

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223837612 https://github.com/pydata/xarray/issues/463#issuecomment-223837612 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzgzNzYxMg== mangecoeur 743508 2016-06-05T21:05:40Z 2016-06-05T21:05:40Z CONTRIBUTOR

So on investigation, even though my dataset creation is wrapped in a with block, using lsof to check the file handles held by my iPython kernel suggests that all the input files are still open. Are you certain that the backend correctly closes files in a multifile dataset? Is there a way to explicitly force this to happen?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223810723 https://github.com/pydata/xarray/issues/463#issuecomment-223810723 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzgxMDcyMw== mangecoeur 743508 2016-06-05T12:34:11Z 2016-06-05T12:34:11Z CONTRIBUTOR

I still hit this issue after wrapping my open_mfdataset in a with statement. I'm suspecting to be an OSX problem, MacOS has a very low default max-open-files limit for applications started from the shell (like 256). It's not yet clear to me whether my datasets are being correctly closed, investigating...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223687053 https://github.com/pydata/xarray/issues/463#issuecomment-223687053 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzY4NzA1Mw== mangecoeur 743508 2016-06-03T20:31:56Z 2016-06-03T20:31:56Z CONTRIBUTOR

It seems to happen even with a freshly restarted notebook, but I'll try a with statement to see if helps. On 3 Jun 2016 19:53, "Stephan Hoyer" notifications@github.com wrote:

I suspect you hit this in IPython after rerunning cells, because file handles are only automatically closed when programs exit. You might find it a good idea to explicitly close files by calling .close() (or using a "with" statement) on Datasets opened with open_mfdataset.

On Fri, Jun 3, 2016 at 11:08 AM, mangecoeur notifications@github.com wrote:

I'm also running into this error - but strangely it only happens when using IPython interactive backend. I have some tests which work fine, but doing the same in IPython fails.

I'm opening a few hundred files (about 10Mb each, one per month across a few variables). I'm using the default NetCDF backend.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/463#issuecomment-223651454, or mute the thread < https://github.com/notifications/unsubscribe/ABKS1sOTvuTtWVVFM7tnP7tnuGKvI-MBks5qIG2YgaJpZM4FWKen

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/463#issuecomment-223663026, or mute the thread https://github.com/notifications/unsubscribe/AAtYVCtspqRb0AXy1ilbgoRuZN_syEDvks5qIHglgaJpZM4FWKen .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223651454 https://github.com/pydata/xarray/issues/463#issuecomment-223651454 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzY1MTQ1NA== mangecoeur 743508 2016-06-03T18:08:24Z 2016-06-03T18:08:24Z CONTRIBUTOR

I'm also running into this error - but strangely it only happens when using IPython interactive backend. I have some tests which work fine, but doing the same in IPython fails.

I'm opening a few hundred files (about 10Mb each, one per month across a few variables). I'm using the default NetCDF backend.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 18.358ms · About: xarray-datasette