home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

4 rows where user = 12929327 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • meridionaljet · 4 ✖

issue 1

  • Memory leak while looping through a Dataset 4

author_association 1

  • NONE 4
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
392235000 https://github.com/pydata/xarray/issues/2186#issuecomment-392235000 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5MjIzNTAwMA== meridionaljet 12929327 2018-05-26T04:11:18Z 2018-05-26T04:11:18Z NONE

Using autoclose=True doesn't seem to make a difference. My test only uses 4 files anyway.

Thanks for the explanation of open_dataset() - that makes sense.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
392226004 https://github.com/pydata/xarray/issues/2186#issuecomment-392226004 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5MjIyNjAwNA== meridionaljet 12929327 2018-05-26T01:35:36Z 2018-05-26T01:35:36Z NONE

I've discovered that setting the environment variable MALLOC_MMAP_MAX_ to a reasonably small value can partially mitigate this memory fragmentation.

Performing 4 iterations over dataset slices of shape ~(5424, 5424) without this tweak was yielding >800MB of memory usage (an increase of ~400MB over the first iteration).

Setting MALLOC_MMAP_MAX_=40960 yielded ~410 MB of memory usage (an increase of only ~130MB over the first iteration).

This level of fragmentation is still offensive, but this does suggest the problem may lie deeper within the entire unix, glibc, Python, xarray, dask ecosystem.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
392217441 https://github.com/pydata/xarray/issues/2186#issuecomment-392217441 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5MjIxNzQ0MQ== meridionaljet 12929327 2018-05-26T00:03:59Z 2018-05-26T00:03:59Z NONE

I'm now wondering if this issue is in dask land, based on this issue: https://github.com/dask/dask/issues/3247

It has been suggested in other places to get around the memory accumulation by running each loop iteration in a forked process:

```python def worker(ds, k): print('accessing data') data = ds.datavar[k,:,:].values print('data acquired')

for k in range(ds.dims['t']): p = multiprocessing.Process(target=worker, args=(ds, k)) p.start() p.join() ``` But apparently one can't access dask-wrapped xarray datasets in subprocesses without a deadlock. I don't know enough about the internals to understand why.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
392110253 https://github.com/pydata/xarray/issues/2186#issuecomment-392110253 https://api.github.com/repos/pydata/xarray/issues/2186 MDEyOklzc3VlQ29tbWVudDM5MjExMDI1Mw== meridionaljet 12929327 2018-05-25T16:23:55Z 2018-05-25T16:24:33Z NONE

Yes, I understand the garbage collection. The problem I'm struggling with here is that normally when working with arrays, maintaining only one reference to an array and replacing the data that reference points to within a loop does not result in memory accumulation because GC is triggered on the prior, now dereferenced array from the previous iteration.

Here, it seems that under the hood, references to arrays have been created other than my "data" variable that are not being dereferenced when I reassign to "data," so stuff is accumulating in memory.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 12.937ms · About: xarray-datasette