home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

2 rows where author_association = "MEMBER", issue = 479190812 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • shoyer · 2 ✖

issue 1

  • open_mfdataset memory leak, very simple case. v0.12 · 2 ✖

author_association 1

  • MEMBER · 2 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
520182257 https://github.com/pydata/xarray/issues/3200#issuecomment-520182257 https://api.github.com/repos/pydata/xarray/issues/3200 MDEyOklzc3VlQ29tbWVudDUyMDE4MjI1Nw== shoyer 1217238 2019-08-10T21:53:39Z 2019-08-10T21:53:39Z MEMBER

Also, if you're having memory issues I also would definitely recommend upgrading to a newer version of xarray. There was a recent fix that helps ensure that files get automatically closed when they are garbage collected, even if you don't call close() or use a context manager explicitly.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset memory leak, very simple case. v0.12 479190812
520182139 https://github.com/pydata/xarray/issues/3200#issuecomment-520182139 https://api.github.com/repos/pydata/xarray/issues/3200 MDEyOklzc3VlQ29tbWVudDUyMDE4MjEzOQ== shoyer 1217238 2019-08-10T21:51:25Z 2019-08-10T21:52:24Z MEMBER

Thanks for the profiling script. I ran a few permutations of this: - xarray.open_mfdataset with engine='netcdf4' (default) - xarray.open_mfdataset with engine='h5netcdf' - xarray.open_dataset with engine='netcdf4' (default) - xarray.open_dataset with engine='h5netcdf'

Here are some plots:

xarray.open_mfdataset with engine='netcdf4': pretty noticeable memory leak, about 0.5 MB / open_mfdataset call:

xarray.open_mfdataset with engine='h5netcdf': looks like a small memory leak, about 0.1 MB / open_mfdataset call:

xarray.open_dataset with engine='netcdf4' (default): definitely has a memory leak:

xarray.open_dataset with engine='h5netcdf': does not appear to have a memory leak:

So in conclusion, it looks like there are memory leaks: 1. when using netCDF4-Python (I was also able to confirm these without using xarray at all, just using netCDF4.Dataset) 2. when using xarray.open_mfdataset

(1) looks like by far the bigger issue, which you can work around by switching to scipy or h5netcdf to read your files.

(2) is an issue for xarray. We do do some caching, specifically with our backend file manager, but given that issues only seem to appear when using open_mfdataset, I suspect it may have more to do with the interaction with Dask, though to be honest I'm not exactly sure how.

Note: I modified your script to xarray's file cache size to 1, which helps smooth out the memory usage: ```python def CreateTestFiles(): # create a bunch of files xlen = int(1e2) ylen = int(1e2) xdim = np.arange(xlen) ydim = np.arange(ylen)

    nfiles = 100
    for i in range(nfiles):
            data = np.random.rand(xlen, ylen, 1)
            datafile = xr.DataArray(data, coords=[xdim, ydim, [i]], dims=['x', 'y', 'time'])
            datafile.to_netcdf('testfile_{}.nc'.format(i))

@profile def ReadFiles(): # for i in range(100): # ds = xr.open_dataset('testfile_{}.nc'.format(i), engine='netcdf4') # ds.close() ds = xr.open_mfdataset(glob.glob('testfile_*'), engine='h5netcdf', concat_dim='time') ds.close()

if name == 'main': # write out files for testing CreateTestFiles()

    xr.set_options(file_cache_maxsize=1)

    # loop thru file read step
    for i in range(100):
            ReadFiles()

```

{
    "total_count": 2,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset memory leak, very simple case. v0.12 479190812

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 198.085ms · About: xarray-datasette