home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

13 rows where issue = 326533369 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 8

  • meridionaljet 4
  • shoyer 3
  • lumbric 1
  • rabernat 1
  • lkilcher 1
  • max-sixty 1
  • Karel-van-de-Plassche 1
  • hmkhatri 1

author_association 3

  • NONE 6
  • MEMBER 5
  • CONTRIBUTOR 2

issue 1

  • Memory leak while looping through a Dataset · 13 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1061602285 https://github.com/pydata/xarray/issues/2186#issuecomment-1061602285 https://api.github.com/repos/pydata/xarray/issues/2186 IC_kwDOAMm_X84_RsPt hmkhatri 17830036 2022-03-08T10:00:07Z 2022-03-08T10:00:07Z NONE

Hello,

I am facing the same memory leak issue. I am using mpirun and dask-mpi on a slurm batch submission (see below). I am running through a time loop to perform some computations. After few iterations, the code blows up because out of memory issue. This does not happen if I execute the same code as a serial job. ``` from dask_mpi import initialize initialize()

from dask.distributed import Client client = Client()

main code goes here

ds = xr.open_mfdataset("*nc")

for i in range(0, len(ds.time)): ds1 = ds.isel(time=i) # perform some computations here

    ds1.close()

ds.close() ````

I have tried the following - explicit ds.close() calls on datasets - gc.collect() - client.cancel(vars)

None of the solutions worked for me. I have also tried increasing RAM but that didn't help either. I was wondering if anyone has found a work around this problem. @lumbric @shoyer @lkilcher

I am using dask 2022.2.0 dask-mpi 2021.11.0 xarray 0.21.1

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
1046665303 https://github.com/pydata/xarray/issues/2186#issuecomment-1046665303 https://api.github.com/repos/pydata/xarray/issues/2186 IC_kwDOAMm_X84-YthX lumbric 691772 2022-02-21T09:41:00Z 2022-02-21T09:41:00Z CONTRIBUTOR

I just stumbled across the same issue and created a minimal example similar to @lkilcher. I am using xr.open_dataarray() with chunks and do some simple computation. After that 800mb of RAM is used, no matter whether I close the file explicitly, delete the xarray objects or invoke the Python garbage collector.

What seems to work: do not use the threading Dask scheduler. The issue does not seem to occur with the single-threaded or processes scheduler. Also setting MALLOC_MMAP_MAX_=40960 seems to solve the issue as suggested above (disclaimer: I don't fully understand the details here).

If I understand things correctly, this indicates that the issue is a consequence of dask/dask#3530. Not sure if there is anything to be fixed on the xarray side or what would be the best work around. I will try to use the processes scheduler.

I can create a new (xarray) ticket with all details about the minimal example, if anyone thinks that this might be helpful (to collect work-a-rounds or discuss fixes on the xarray side).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
1035611864 https://github.com/pydata/xarray/issues/2186#issuecomment-1035611864 https://api.github.com/repos/pydata/xarray/issues/2186 IC_kwDOAMm_X849ui7Y shoyer 1217238 2022-02-10T22:49:40Z 2022-02-10T22:50:01Z MEMBER

For what it's wroth, the recommended way to do this is to explicitly close the Dataset with ds.close() rather than using del ds.

Or with a context manager, e.g., python for num in range(100): with xr.open_dataset('data.{}.nc'.format(num)) as ds: # do some stuff, but NOT assigning any data in ds to new variables ...

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
1035593183 https://github.com/pydata/xarray/issues/2186#issuecomment-1035593183 https://api.github.com/repos/pydata/xarray/issues/2186 IC_kwDOAMm_X849ueXf lkilcher 2273361 2022-02-10T22:24:37Z 2022-02-10T22:24:37Z NONE

Hey folks, I ran into a similar memory leak issue. In my case a had the following:

for num in range(100):
    ds = xr.open_dataset('data.{}.nc'.format(num)) # This data was compressed with zlib, not sure if that matters

    # do some stuff, but NOT assigning any data in ds to new variables

    del ds

For some reason (maybe having to do with the # do some stuff), ds wasn't actually getting cleared. I was able to fix the problem by manually triggering garbage collection (import gc, and gc.collect() after the del ds statement). Perhaps this will help others who end up here...

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
454162269 https://github.com/pydata/xarray/issues/2186#issuecomment-454162269 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDQ1NDE2MjI2OQ== max-sixty 5635139 2019-01-14T21:09:36Z 2019-01-14T21:09:36Z MEMBER

In an effort to reduce the issue backlog, I'll close this, but please reopen if you disagree

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
393996561 https://github.com/pydata/xarray/issues/2186#issuecomment-393996561 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5Mzk5NjU2MQ== shoyer 1217238 2018-06-01T20:13:18Z 2018-06-01T20:13:18Z MEMBER

This might be the same issue as https://github.com/dask/dask/issues/3530

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
393846595 https://github.com/pydata/xarray/issues/2186#issuecomment-393846595 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5Mzg0NjU5NQ== Karel-van-de-Plassche 6404167 2018-06-01T10:57:09Z 2018-06-01T10:57:09Z CONTRIBUTOR

@meridionaljet I might've run into the same issue, but I'm not 100% sure. In my case I'm looping over a Dataset containing variables from 3 different files, all of them with a .sel and some of them with a more complicated (dask) calculation. (still, mostly sums and divisions) The leak seems mostly happening for those with the calculation.

Can you see what happens when using the distributed client? Put client = dask.distributed.Client() in front of your code. This leads to many distributed.utils_perf - WARNING - full garbage collections took 40% CPU time recently (threshold: 10%) messages being shown for me, indeed pointing to something garbage-collecty.

Also, for me the memory behaviour looks very different between the threaded and multi-process scheduler, although they both leak. (I'm not sure if leaking is the right term here). Maybe you can try memory_profiler?

I've tried without succes: - explicitly deleting ds[varname] and running gc.collect() - explicitly clearing dask cache with client.cancel and client.restart - Moving the leaky code in its own function (should not matter, but I seemed to remember that it sometimes helps for garbage collect in edge cases) - Explicitly triggering computation with either dask persist or xarray load and then explicitly deleting the result

For my messy and very much work in process code, look here: https://github.com/Karel-van-de-Plassche/QLKNN-develop/blob/master/qlknn/dataset/hypercube_to_pandas.py

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
392235000 https://github.com/pydata/xarray/issues/2186#issuecomment-392235000 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5MjIzNTAwMA== meridionaljet 12929327 2018-05-26T04:11:18Z 2018-05-26T04:11:18Z NONE

Using autoclose=True doesn't seem to make a difference. My test only uses 4 files anyway.

Thanks for the explanation of open_dataset() - that makes sense.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
392234293 https://github.com/pydata/xarray/issues/2186#issuecomment-392234293 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5MjIzNDI5Mw== shoyer 1217238 2018-05-26T03:58:14Z 2018-05-26T03:58:14Z MEMBER

I might try experimenting with setting autoclose=True in open_mfdataset(). It's a bit of a short in the dark, but that might help here.

Memory growth with xr.open_dataset('data.nc', chunks=None) is expected, because by default we set cache=True when not using dask. This means that variables get cached in memory as NumPy arrays.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
392226004 https://github.com/pydata/xarray/issues/2186#issuecomment-392226004 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5MjIyNjAwNA== meridionaljet 12929327 2018-05-26T01:35:36Z 2018-05-26T01:35:36Z NONE

I've discovered that setting the environment variable MALLOC_MMAP_MAX_ to a reasonably small value can partially mitigate this memory fragmentation.

Performing 4 iterations over dataset slices of shape ~(5424, 5424) without this tweak was yielding >800MB of memory usage (an increase of ~400MB over the first iteration).

Setting MALLOC_MMAP_MAX_=40960 yielded ~410 MB of memory usage (an increase of only ~130MB over the first iteration).

This level of fragmentation is still offensive, but this does suggest the problem may lie deeper within the entire unix, glibc, Python, xarray, dask ecosystem.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
392217441 https://github.com/pydata/xarray/issues/2186#issuecomment-392217441 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5MjIxNzQ0MQ== meridionaljet 12929327 2018-05-26T00:03:59Z 2018-05-26T00:03:59Z NONE

I'm now wondering if this issue is in dask land, based on this issue: https://github.com/dask/dask/issues/3247

It has been suggested in other places to get around the memory accumulation by running each loop iteration in a forked process:

```python def worker(ds, k): print('accessing data') data = ds.datavar[k,:,:].values print('data acquired')

for k in range(ds.dims['t']): p = multiprocessing.Process(target=worker, args=(ds, k)) p.start() p.join() ``` But apparently one can't access dask-wrapped xarray datasets in subprocesses without a deadlock. I don't know enough about the internals to understand why.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
392110253 https://github.com/pydata/xarray/issues/2186#issuecomment-392110253 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5MjExMDI1Mw== meridionaljet 12929327 2018-05-25T16:23:55Z 2018-05-25T16:24:33Z NONE

Yes, I understand the garbage collection. The problem I'm struggling with here is that normally when working with arrays, maintaining only one reference to an array and replacing the data that reference points to within a loop does not result in memory accumulation because GC is triggered on the prior, now dereferenced array from the previous iteration.

Here, it seems that under the hood, references to arrays have been created other than my "data" variable that are not being dereferenced when I reassign to "data," so stuff is accumulating in memory.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
392108417 https://github.com/pydata/xarray/issues/2186#issuecomment-392108417 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5MjEwODQxNw== rabernat 1197350 2018-05-25T16:17:15Z 2018-05-25T16:17:15Z MEMBER

The memory management here is handled by python, not xarray. Python decides when to perform garbage collection. I know that doesn't help solve your problem...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 96.083ms · About: xarray-datasette