github: issue_comments: 4 rows where user = 6404167 sorted by updated

4 rows where user = 6404167 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
393846595	https://github.com/pydata/xarray/issues/2186#issuecomment-393846595	https://api.github.com/repos/pydata/xarray/issues/2186	MDEyOklzc3VlQ29tbWVudDM5Mzg0NjU5NQ==	Karel-van-de-Plassche 6404167	2018-06-01T10:57:09Z	2018-06-01T10:57:09Z	CONTRIBUTOR	@meridionaljet I might've run into the same issue, but I'm not 100% sure. In my case I'm looping over a Dataset containing variables from 3 different files, all of them with a `.sel` and some of them with a more complicated (dask) calculation. (still, mostly sums and divisions) The leak seems mostly happening for those with the calculation. Can you see what happens when using the distributed client? Put `client = dask.distributed.Client()` in front of your code. This leads to many `distributed.utils_perf - WARNING - full garbage collections took 40% CPU time recently (threshold: 10%)` messages being shown for me, indeed pointing to something garbage-collecty. Also, for me the memory behaviour looks very different between the threaded and multi-process scheduler, although they both leak. (I'm not sure if leaking is the right term here). Maybe you can try `memory_profiler`? I've tried without succes: - explicitly deleting `ds[varname]` and running `gc.collect()` - explicitly clearing dask cache with `client.cancel` and `client.restart` - Moving the leaky code in its own function (should not matter, but I seemed to remember that it sometimes helps for garbage collect in edge cases) - Explicitly triggering computation with either dask `persist` or xarray `load` and then explicitly deleting the result For my messy and very much work in process code, look here: https://github.com/Karel-van-de-Plassche/QLKNN-develop/blob/master/qlknn/dataset/hypercube_to_pandas.py	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Memory leak while looping through a Dataset 326533369
393066421	https://github.com/pydata/xarray/issues/2198#issuecomment-393066421	https://api.github.com/repos/pydata/xarray/issues/2198	MDEyOklzc3VlQ29tbWVudDM5MzA2NjQyMQ==	Karel-van-de-Plassche 6404167	2018-05-30T07:56:52Z	2018-05-30T08:04:34Z	CONTRIBUTOR	Might be related to: https://github.com/pydata/xarray/issues/1225#issuecomment-307519054 https://github.com/pydata/xarray/issues/628	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	DataArray.encoding['chunksizes'] not respected in to_netcdf 327613219
392666250	https://github.com/pydata/xarray/issues/2190#issuecomment-392666250	https://api.github.com/repos/pydata/xarray/issues/2190	MDEyOklzc3VlQ29tbWVudDM5MjY2NjI1MA==	Karel-van-de-Plassche 6404167	2018-05-29T06:27:52Z	2018-05-29T06:35:02Z	CONTRIBUTOR	@shoyer Thanks for your answer. Too bad. Maybe this could be documented in the 'dask' chapter? Or maybe even raise a warning when using open_dataset with `lock=False` on a netCDF4 file? Unfortunately there seems to be some conflicting information floating around, which is hard to spot for a non-expert like me. It might of course just be that xarray doesn't support it (yet). I think MPI-style opening is a whole different beast, right? For example: python-netcdf4 support parallel read in threads: https://github.com/Unidata/netcdf4-python/issues/536 python-netcdf4 MPI parallel write/read: https://github.com/Unidata/netcdf4-python/blob/master/examples/mpi_example.py http://unidata.github.io/netcdf4-python/#section13 Using h5py directly (not supported by xarray I think): http://docs.h5py.org/en/latest/mpi.html Seems to suggest multiple read is fine: https://github.com/dask/dask/issues/3074#issuecomment-359030028 You might have better luck using dask-distributed multiple processes, but then you'll encounter other bottlenecks with data transfer. I'll do some more experiments, thanks for this suggestion. I am not bound to netCDF4 (although I need the compression, so no netCDF3 unfortunately), so would moving to Zarr help improving IO performance? I'd really like to keep using xarray, thanks for this awesome library! Even with the disk IO performance hit, it's still more than worth it to use it.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Parallel non-locked read using dask.Client crashes 327064908
392572591	https://github.com/pydata/xarray/issues/1971#issuecomment-392572591	https://api.github.com/repos/pydata/xarray/issues/1971	MDEyOklzc3VlQ29tbWVudDM5MjU3MjU5MQ==	Karel-van-de-Plassche 6404167	2018-05-28T17:12:51Z	2018-05-28T17:13:56Z	CONTRIBUTOR	Seems like the distributed scheduler is the advised one to use in general, so maybe some tests could be added for this one. For sure for diskIO, would be interesting to see the difference. http://dask.pydata.org/en/latest/setup.html Note that the newer dask.distributed scheduler is often preferable even on single workstations. It contains many diagnostics and features not found in the older single-machine scheduler. The following pages explain in more detail how to set up Dask on a variety of local and distributed hardware.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Should we be testing against multiple dask schedulers? 302930480

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);