home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

7 rows where issue = 479190812 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 5

  • shoyer 2
  • crusaderky 2
  • floschl 1
  • deeplycloudy 1
  • bsu-wrudisill 1

author_association 3

  • MEMBER 4
  • NONE 2
  • CONTRIBUTOR 1

issue 1

  • open_mfdataset memory leak, very simple case. v0.12 · 7 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1416446874 https://github.com/pydata/xarray/issues/3200#issuecomment-1416446874 https://api.github.com/repos/pydata/xarray/issues/3200 IC_kwDOAMm_X85UbUOa deeplycloudy 1325771 2023-02-03T21:52:57Z 2023-02-03T21:52:57Z CONTRIBUTOR

I was iterating today over a large dataset loaded with open_mfdataset, and had been observing memory usage growing from 2GB to 8GB+.

I can confirm that xr.set_options(file_cache_maxsize=1) kept memory use at a steady 2GB, properly releasing memory.

libnetcdf 4.8.1 nompi_h261ec11_106 conda-forge netcdf4 1.6.0 nompi_py310h0a86a1f_103 conda-forge xarray 2023.1.0 pyhd8ed1ab_0 conda-forge dask 2023.1.0 pyhd8ed1ab_0 conda-forge dask-core 2023.1.0 pyhd8ed1ab_0 conda-forge

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset memory leak, very simple case. v0.12 479190812
530800751 https://github.com/pydata/xarray/issues/3200#issuecomment-530800751 https://api.github.com/repos/pydata/xarray/issues/3200 MDEyOklzc3VlQ29tbWVudDUzMDgwMDc1MQ== floschl 1262767 2019-09-12T12:24:12Z 2019-09-12T12:36:02Z NONE

I have observed a similar memleak (config see below). It occurs for both parameters engine=netcdf4 and engine=h5netcdf.

Example for loading a 1.2GB netCDF file: In contrast, the memory is just released with a del ds on the object, this is the large memory (2.6GB) - a ds.close() has no effect. There is still a "minor" memleak remaining (4MB), when a open_dataset is called. See the output using the memory_profiler package:

python Line # Mem usage Increment Line Contents ================================================ 31 168.9 MiB 168.9 MiB @profile 32 def load_and_unload_ds(): 33 173.0 MiB 4.2 MiB ds = xr.open_dataset(LFS_DATA_DIR + '/dist2coast_1deg_merged.nc') 34 2645.4 MiB 2472.4 MiB ds.load() 35 2645.4 MiB 0.0 MiB ds.close() 36 173.5 MiB 0.0 MiB del ds

  • there is no difference using open_dataset(file, engine='h5netcdf'), the minor memleak is even larger (~9MB).
  • memory leak persists, if an additional chunks parameter is used for open_dataset

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-62-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.6.2 xarray: 0.12.3 pandas: 0.25.1 numpy: 1.16.4 scipy: 1.2.1 netCDF4: 1.5.1.2 pydap: None h5netcdf: 0.7.4 h5py: 2.9.0 Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.3.0 distributed: 2.3.2 matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: None numbagg: None setuptools: 41.0.1 pip: 19.2.3 conda: None pytest: 5.0.1 IPython: 7.7.0 sphinx: None
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset memory leak, very simple case. v0.12 479190812
520571376 https://github.com/pydata/xarray/issues/3200#issuecomment-520571376 https://api.github.com/repos/pydata/xarray/issues/3200 MDEyOklzc3VlQ29tbWVudDUyMDU3MTM3Ng== bsu-wrudisill 19933988 2019-08-12T19:56:09Z 2019-08-12T19:56:09Z NONE

Awesome, thanks @shoyer and @crusaderky for looking into this. I've tested it with the h5netcdf engine and it the leak is mostly mitigated... for the simple case at least. Unfortunately the actual model files that I'm working with do not appear to be compatible with h5py (I believe related to this issue https://github.com/h5py/h5py/issues/719). But that's another problem entirely!

@crusaderky, I will hopefully get to trying your suggestions 3) and 4). As for your last point, I haven't tested explicitly, but yes I believe that it does continue to grow linearly more iterations.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset memory leak, very simple case. v0.12 479190812
520182257 https://github.com/pydata/xarray/issues/3200#issuecomment-520182257 https://api.github.com/repos/pydata/xarray/issues/3200 MDEyOklzc3VlQ29tbWVudDUyMDE4MjI1Nw== shoyer 1217238 2019-08-10T21:53:39Z 2019-08-10T21:53:39Z MEMBER

Also, if you're having memory issues I also would definitely recommend upgrading to a newer version of xarray. There was a recent fix that helps ensure that files get automatically closed when they are garbage collected, even if you don't call close() or use a context manager explicitly.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset memory leak, very simple case. v0.12 479190812
520182139 https://github.com/pydata/xarray/issues/3200#issuecomment-520182139 https://api.github.com/repos/pydata/xarray/issues/3200 MDEyOklzc3VlQ29tbWVudDUyMDE4MjEzOQ== shoyer 1217238 2019-08-10T21:51:25Z 2019-08-10T21:52:24Z MEMBER

Thanks for the profiling script. I ran a few permutations of this: - xarray.open_mfdataset with engine='netcdf4' (default) - xarray.open_mfdataset with engine='h5netcdf' - xarray.open_dataset with engine='netcdf4' (default) - xarray.open_dataset with engine='h5netcdf'

Here are some plots:

xarray.open_mfdataset with engine='netcdf4': pretty noticeable memory leak, about 0.5 MB / open_mfdataset call:

xarray.open_mfdataset with engine='h5netcdf': looks like a small memory leak, about 0.1 MB / open_mfdataset call:

xarray.open_dataset with engine='netcdf4' (default): definitely has a memory leak:

xarray.open_dataset with engine='h5netcdf': does not appear to have a memory leak:

So in conclusion, it looks like there are memory leaks: 1. when using netCDF4-Python (I was also able to confirm these without using xarray at all, just using netCDF4.Dataset) 2. when using xarray.open_mfdataset

(1) looks like by far the bigger issue, which you can work around by switching to scipy or h5netcdf to read your files.

(2) is an issue for xarray. We do do some caching, specifically with our backend file manager, but given that issues only seem to appear when using open_mfdataset, I suspect it may have more to do with the interaction with Dask, though to be honest I'm not exactly sure how.

Note: I modified your script to xarray's file cache size to 1, which helps smooth out the memory usage: ```python def CreateTestFiles(): # create a bunch of files xlen = int(1e2) ylen = int(1e2) xdim = np.arange(xlen) ydim = np.arange(ylen)

    nfiles = 100
    for i in range(nfiles):
            data = np.random.rand(xlen, ylen, 1)
            datafile = xr.DataArray(data, coords=[xdim, ydim, [i]], dims=['x', 'y', 'time'])
            datafile.to_netcdf('testfile_{}.nc'.format(i))

@profile def ReadFiles(): # for i in range(100): # ds = xr.open_dataset('testfile_{}.nc'.format(i), engine='netcdf4') # ds.close() ds = xr.open_mfdataset(glob.glob('testfile_*'), engine='h5netcdf', concat_dim='time') ds.close()

if name == 'main': # write out files for testing CreateTestFiles()

    xr.set_options(file_cache_maxsize=1)

    # loop thru file read step
    for i in range(100):
            ReadFiles()

```

{
    "total_count": 2,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset memory leak, very simple case. v0.12 479190812
520136799 https://github.com/pydata/xarray/issues/3200#issuecomment-520136799 https://api.github.com/repos/pydata/xarray/issues/3200 MDEyOklzc3VlQ29tbWVudDUyMDEzNjc5OQ== crusaderky 6213168 2019-08-10T10:10:11Z 2019-08-10T10:11:18Z MEMBER

Oh but first and foremost - CPython memory management is designed so that, when PyMem_Free() is invoked, CPython will hold on to it and not invoke the underlying free() syscall, hoping to reuse it on the next PyMem_Alloc(). An increase in RAM usage from 160 to 200MB could very well be explained by this. Try increasing the number of loops in your test 100-fold and see if you get a 100-fold increase in memory usage too (from 160MB to 1.2GB). If yes, it's a real leak; if it remains much more contained, it's normal CPython behaviour.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset memory leak, very simple case. v0.12 479190812
520136482 https://github.com/pydata/xarray/issues/3200#issuecomment-520136482 https://api.github.com/repos/pydata/xarray/issues/3200 MDEyOklzc3VlQ29tbWVudDUyMDEzNjQ4Mg== crusaderky 6213168 2019-08-10T10:06:07Z 2019-08-10T10:06:07Z MEMBER

Hi,

xarray doesn't have any global objects that I know of that can cause the leak - I'm willing to bet on the underlying libraries.

  1. given your installed packages, open_mfdataset should be defaulting NetCDF4. Please try your measure again after setting it explicitly open_mfdataset(..., engine='netcdf4')
  2. See if the problem disappears if you pass engine='h5netcdf'
  3. Once you have confirmed the actual underlying library, try using it directly without xarray in your ReadFiles test: for every file returned by glob, open it with the netCDF4 package and load into memory all coords (not the data).
  4. Once NetCDF4 is confirmed to be the culprit, if you can it would be great if you could rewrite the test (only the read part) in C using the NetCDF C library to figure out if the leak is in it or in the Python wrapper.
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset memory leak, very simple case. v0.12 479190812

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 429.954ms · About: xarray-datasette