github: issue_comments: 23 rows where author_association = "MEMBER", issue = 142498006 and user = 306380 sorted by updated

23 rows where author_association = "MEMBER", issue = 142498006 and user = 306380 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
305506896	https://github.com/pydata/xarray/issues/798#issuecomment-305506896	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDMwNTUwNjg5Ng==	mrocklin 306380	2017-06-01T14:17:11Z	2017-06-01T14:17:11Z	MEMBER	@shoyer regarding per-file locking this probably only matters if we are writing as well, yes? Here is a small implementation of a generic file-open cache. I haven't yet decided on a eviction policy but either LRU or random (filtered by closeable files) should work OK. ```python from contextlib import contextmanager import threading class OpenCache(object): def init(self, maxsize=100): self.refcount = defaultdict(lambda: 0) self.maxsize = 0 self.cache = {} self.i = 0 self.lock = threading.Lock() `@contextmanager def open(self, myopen, fn, mode='r'): assert 'r' in mode key = (myopen, fn, mode) with self.lock: try: file = self.cache[key] except KeyError: file = myopen(fn, mode=mode) self.cache[key] = file self.refcount[key] += 1 if len(self.cache) > self.maxsize: # Clear old files intelligently try: yield file finally: with self.lock: self.refcount[key] -= 1` cache = OpenCache() with cache.open(h5py.File, 'myfile.hdf5') as f: x = f['/data/x'] y = x[:1000, :1000] ``` Is this still useful? I'm curious to hear from users like @pwolfram and @rabernat who may be running into the many file problem about what the current pain points are.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
288415152	https://github.com/pydata/xarray/issues/798#issuecomment-288415152	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI4ODQxNTE1Mg==	mrocklin 306380	2017-03-22T14:26:08Z	2017-03-22T14:26:08Z	MEMBER	Has anyone used XArray on NetCDF data on cluster without resorting to any tricks?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
263470325	https://github.com/pydata/xarray/issues/798#issuecomment-263470325	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI2MzQ3MDMyNQ==	mrocklin 306380	2016-11-29T04:02:05Z	2016-11-29T04:02:05Z	MEMBER	A lock on the LRU cache makes sense to me. We need separate, per file locks, to ensure that we don't evict files in the process of reading or writing data from them (which would cause segfaults). As a stop-gap measure, we could simply refuse to evict files until we can acquire a lock, but more broadly this suggests that strict LRU is not quite right. Instead, we want to evict the least-recently-used unlocked item If it were me I would just block on the evicted file until it becomes available (the stop-gap measure) until it became a performance problem.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
262236329	https://github.com/pydata/xarray/issues/798#issuecomment-262236329	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI2MjIzNjMyOQ==	mrocklin 306380	2016-11-22T13:10:03Z	2016-11-22T13:11:48Z	MEMBER	One solution is to create protocols on the Dask side to enable `dask.distributed.Client.persist` itself to work on XArray objects. This keeps the scheduler specific details like persist on the scheduler.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
259277405	https://github.com/pydata/xarray/issues/798#issuecomment-259277405	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1OTI3NzQwNQ==	mrocklin 306380	2016-11-08T22:18:42Z	2016-11-08T22:18:42Z	MEMBER	Yes. On Tue, Nov 8, 2016 at 5:17 PM, Florian Rathgeber notifications@github.com wrote: Great to see this moving! I take it the workshop was productive? How does #1095 https://github.com/pydata/xarray/pull/1095 work in the scenario of a distributed scheduler with remote workers? Do I understand correctly that all workers and the client would need to see the same shared filesystem from where NetCDF files are read? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/798#issuecomment-259277067, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszCe45oN0_1tBsrCycyr2N01M75xNks5q8PTsgaJpZM4H1p4q .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
259181856	https://github.com/pydata/xarray/issues/798#issuecomment-259181856	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1OTE4MTg1Ng==	mrocklin 306380	2016-11-08T16:17:20Z	2016-11-08T16:17:20Z	MEMBER	FYI Dask is committed to maintaining this: https://github.com/dask/zict/blob/master/zict/lru.py	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
257918615	https://github.com/pydata/xarray/issues/798#issuecomment-257918615	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NzkxODYxNQ==	mrocklin 306380	2016-11-02T16:27:45Z	2016-11-02T16:27:45Z	MEMBER	Custom serialization is in dask/distributed. This allows for us to build custom serialization solutions like the following for `h5py.Dataset`: https://github.com/dask/distributed/pull/620/files Any concerns would be very welcome. Earlier is better.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
257292279	https://github.com/pydata/xarray/issues/798#issuecomment-257292279	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NzI5MjI3OQ==	mrocklin 306380	2016-10-31T13:24:01Z	2016-10-31T14:49:31Z	MEMBER	I may have a solution to this in https://github.com/dask/distributed/pull/606, which allows for custom serialization formats to be registered with dask.distributed. We would register serialize and deserialize functions for the various netCDF objects. Something like the following might work for h5py: ``` python def serialize_dataset(dset): header = {} frames = [dset.filename.encode(), dset.datapath.encode()] return header, frames def deserialize_dataset(header, frames): filename, datapath = frames f = h5py.File(filename.decode()) dest = f[datapath.decode()] return dset register_serialization(h5py.Dataset, serialize_dataset, deserialize_dataset) ``` We still have lingering open files but not too many per machine. They'll move around the network, but only as necessary.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
257063168	https://github.com/pydata/xarray/issues/798#issuecomment-257063168	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NzA2MzE2OA==	mrocklin 306380	2016-10-29T01:37:11Z	2016-10-29T01:37:11Z	MEMBER	We could pull data from OpenDAP. Actually computing on those workers would probably be hard to integrate. Distributed Dask.array could possibly replace OpenDAP in some settings though, serving not only data, but also computation.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
256121613	https://github.com/pydata/xarray/issues/798#issuecomment-256121613	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NjEyMTYxMw==	mrocklin 306380	2016-10-25T18:20:58Z	2016-10-25T18:20:58Z	MEMBER	You wouldn't On Tue, Oct 25, 2016 at 9:43 AM, Florian Rathgeber <notifications@github.com wrote: For the case where NetCDF / HDF5 files are only available on the distributed workers and not directly accessible from the client, how would you get the necessary metadata (coords, dims etc.) to construct the xarray.Dataset? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/798#issuecomment-256038226, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszIYbrYoqqJMwu5FFoxu5SWSJSTnoks5q3geGgaJpZM4H1p4q .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
255800363	https://github.com/pydata/xarray/issues/798#issuecomment-255800363	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NTgwMDM2Mw==	mrocklin 306380	2016-10-24T17:00:58Z	2016-10-24T17:00:58Z	MEMBER	One alternative would be to define custom serialization for `netCDF4.Dataset` objects. I've been toying with the idea of custom serialization for dask.distributed recently. This was originally intended to let Dask make some opinionated serialization choices for some common formats (usually so that we can serialize numpy arrays and pandas dataframes faster than their generic pickle implementations allow) but this might also be helpful here to allow us to serialize netCDF4.Dataset objects and friends. We would define custom dumps and loads functions for netCDF4.Dataset objects that would presumably encode them as a filename and datapath. This would get around the open-many-files issue because the dataset would stay in the worker's `.data` dictionary while it was needed. One concern is that there are reasons why netCDF4.Dataset objects are not serializable (see https://github.com/h5py/h5py/issues/531). I'm not sure if this would affect XArray workloads.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
255796531	https://github.com/pydata/xarray/issues/798#issuecomment-255796531	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NTc5NjUzMQ==	mrocklin 306380	2016-10-24T16:46:39Z	2016-10-24T16:46:39Z	MEMBER	We seem to be making good progress here on the issue. I'm also happy to switch to real-time voice at any point today or tomorrow if people prefer.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
255795874	https://github.com/pydata/xarray/issues/798#issuecomment-255795874	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NTc5NTg3NA==	mrocklin 306380	2016-10-24T16:44:10Z	2016-10-24T16:44:10Z	MEMBER	We could possibly make an object that was API compatible with the subset of netCDF4.Dataset that you needed, but opened and closed the file whenever it actually pulled data. We would keep an LRU cache of open files around for efficiency as discussed earlier. In this case we could possibly optionally swap out the current netCDF4.Dataset object with this thing without much refactoring?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
255788269	https://github.com/pydata/xarray/issues/798#issuecomment-255788269	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NTc4ODI2OQ==	mrocklin 306380	2016-10-24T16:15:59Z	2016-10-24T16:15:59Z	MEMBER	The `futures_to_dask_arrays` function has been deprecated at this point. The standard way to produce a distributed dask.array from custom functions is as follows: - Use dask.delayed to construct many lazy numpy arrays individually - Wrap each of these into a single-chunk dask.array using `da.from_delayed(lazy_value, shape=..., dtype=...)` - Use `da.stack` or `da.concat` to arrange these single-chunk dask.arrays into a larger dask.array. The same approach could be used with XArray except that presumably we would need to do this for every relevant dataset within the NetCDF file.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
255201276	https://github.com/pydata/xarray/issues/798#issuecomment-255201276	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NTIwMTI3Ng==	mrocklin 306380	2016-10-20T19:16:59Z	2016-10-20T19:16:59Z	MEMBER	I agree that we should discuss it at the workshop. I also think it's possible that this could be accomplished by the right person (or combination of people) in a few hours. If so I think that we should come with it in hand as a capability that exists rather than a capability that should exist.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
255194191	https://github.com/pydata/xarray/issues/798#issuecomment-255194191	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NTE5NDE5MQ==	mrocklin 306380	2016-10-20T18:48:53Z	2016-10-20T18:48:53Z	MEMBER	I agree that this conversation needs expertise from a core xarray developer. I suspect that this change is more likely to happen in xarray than in dask.array. Happy to continue the conversation wherever. I do have a slight preference to switch to real-time at some point though. I suspect that we can hash this out in a moderate number of minutes.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
255190556	https://github.com/pydata/xarray/issues/798#issuecomment-255190556	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NTE5MDU1Ng==	mrocklin 306380	2016-10-20T18:35:27Z	2016-10-20T18:35:27Z	MEMBER	If XArray devs want to chat sometime I suspect we could hammer out a plan fairly quickly. My hope is that once a plan exists then a developer will arise to implement that plan. I'm free all of today and tomorrow.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
255190289	https://github.com/pydata/xarray/issues/798#issuecomment-255190289	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NTE5MDI4OQ==	mrocklin 306380	2016-10-20T18:34:35Z	2016-10-20T18:34:35Z	MEMBER	Definitely happy to support from the Dask side. I think that the LRU method described above is feasible.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
255187606	https://github.com/pydata/xarray/issues/798#issuecomment-255187606	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDI1NTE4NzYwNg==	mrocklin 306380	2016-10-20T18:24:10Z	2016-10-20T18:24:10Z	MEMBER	I haven't worked on this but agree that it is important.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
209005106	https://github.com/pydata/xarray/issues/798#issuecomment-209005106	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDIwOTAwNTEwNg==	mrocklin 306380	2016-04-12T16:55:02Z	2016-04-12T16:55:02Z	MEMBER	It's probably best to avoid futures within `xarray`, so far they're only in the distributed memory scheduler. I think that ideally we create graphs that can be used robustly in either. I think that the memoized `netCDF4_Dataset` approach can probably do this just fine. Is there anything that is needed from me to help push this forward?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
204813232	https://github.com/pydata/xarray/issues/798#issuecomment-204813232	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDIwNDgxMzIzMg==	mrocklin 306380	2016-04-02T22:29:04Z	2016-04-02T22:29:04Z	MEMBER	FWIW I've uploaded a tiny LRU dict implementation to a new `zict` project (which also has some other stuff): http://zict.readthedocs.org/en/latest/ `pip install zict` `python from zict import LRU d = LRU(100, dict())` There are a number of good alternatives out there though for LRU dictionaries.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
200901275	https://github.com/pydata/xarray/issues/798#issuecomment-200901275	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDIwMDkwMTI3NQ==	mrocklin 306380	2016-03-24T16:00:52Z	2016-03-24T16:00:52Z	MEMBER	I believe that robustly supporting HDF/NetCDF reads with the mechanism mentioned above will resolve most problems from a dask.array perspective. I have no doubt that other things will arise though. Switching from shared to distributed memory always come with (surmountable) obstacles	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006
199545836	https://github.com/pydata/xarray/issues/798#issuecomment-199545836	https://api.github.com/repos/pydata/xarray/issues/798	MDEyOklzc3VlQ29tbWVudDE5OTU0NTgzNg==	mrocklin 306380	2016-03-21T23:59:18Z	2016-03-21T23:59:18Z	MEMBER	Copying over a comment from that issue: Yes, so the problem as I see it is that, for serialization and open-file reasons we want to use a function like the following: `python def get_chunk_of_array(filename, datapath, slice): with netCDF4.Dataset(filename) as f: return f.variables[datapath][slice]` However, this opens and closes many files, which while robust, is slow. We can alleviate this by maintaining an LRU cache in a global variable so that it is created separately per process. ``` python from toolz import memoize cache = LRUDict(size=100, on_eviction=lambda file: file.close()) netCDF4_Dataset = memoize(netCDF4.Dataset, cache=cache) def def get_chunk_of_array(filename, datapath, slice): f = netCDF4_Dataset(filename) return f.variables[datapath][slice] ``` I'm happy to supply the `memoize` function with `toolz` and an appropriate `LRUDict` object with other microprojects that I can publish if necessary. We would then need to use such a function within the dask.array and xarary codebases. Anyway, that's one approach. Thoughts welcome.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Integration with dask/distributed (xarray backend design) 142498006

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);