issue_comments
59 rows where issue = 142498006 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: reactions, created_at (date), updated_at (date)
issue 1
- Integration with dask/distributed (xarray backend design) · 59 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
453800654 | https://github.com/pydata/xarray/issues/798#issuecomment-453800654 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDQ1MzgwMDY1NA== | jhamman 2443309 | 2019-01-13T04:12:32Z | 2019-01-13T04:12:32Z | MEMBER | Closing this old issue. The final checkbox in @pwolfram's original post was completed in #2261. |
{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 1, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
305506896 | https://github.com/pydata/xarray/issues/798#issuecomment-305506896 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDMwNTUwNjg5Ng== | mrocklin 306380 | 2017-06-01T14:17:11Z | 2017-06-01T14:17:11Z | MEMBER | @shoyer regarding per-file locking this probably only matters if we are writing as well, yes? Here is a small implementation of a generic file-open cache. I haven't yet decided on a eviction policy but either LRU or random (filtered by closeable files) should work OK. ```python from contextlib import contextmanager import threading class OpenCache(object): def init(self, maxsize=100): self.refcount = defaultdict(lambda: 0) self.maxsize = 0 self.cache = {} self.i = 0 self.lock = threading.Lock()
cache = OpenCache() with cache.open(h5py.File, 'myfile.hdf5') as f: x = f['/data/x'] y = x[:1000, :1000] ``` Is this still useful? I'm curious to hear from users like @pwolfram and @rabernat who may be running into the many file problem about what the current pain points are. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
288415152 | https://github.com/pydata/xarray/issues/798#issuecomment-288415152 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI4ODQxNTE1Mg== | mrocklin 306380 | 2017-03-22T14:26:08Z | 2017-03-22T14:26:08Z | MEMBER | Has anyone used XArray on NetCDF data on cluster without resorting to any tricks? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
288414396 | https://github.com/pydata/xarray/issues/798#issuecomment-288414396 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI4ODQxNDM5Ng== | pwolfram 4295853 | 2017-03-22T14:23:45Z | 2017-03-22T14:23:45Z | CONTRIBUTOR | @mrocklin and @shoyer, we now have dask.distributed and xarray support. Should this issue be closed? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
263470325 | https://github.com/pydata/xarray/issues/798#issuecomment-263470325 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI2MzQ3MDMyNQ== | mrocklin 306380 | 2016-11-29T04:02:05Z | 2016-11-29T04:02:05Z | MEMBER | A lock on the LRU cache makes sense to me.
If it were me I would just block on the evicted file until it becomes available (the stop-gap measure) until it became a performance problem. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
263431065 | https://github.com/pydata/xarray/issues/798#issuecomment-263431065 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI2MzQzMTA2NQ== | shoyer 1217238 | 2016-11-28T23:42:54Z | 2016-11-28T23:42:54Z | MEMBER | @mrocklin Any thoughts on my thread safety concerns (https://github.com/pydata/xarray/issues/798#issuecomment-259202265) for the LRU cache? I suppose the simplest thing to do is to simply refuse to evict a file until the per-file lock is released, but I can see that strategy failing pretty badly in edge cases. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
262236329 | https://github.com/pydata/xarray/issues/798#issuecomment-262236329 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI2MjIzNjMyOQ== | mrocklin 306380 | 2016-11-22T13:10:03Z | 2016-11-22T13:11:48Z | MEMBER | One solution is to create protocols on the Dask side to enable |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
262214999 | https://github.com/pydata/xarray/issues/798#issuecomment-262214999 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI2MjIxNDk5OQ== | kynan 346079 | 2016-11-22T11:18:56Z | 2016-11-22T11:18:56Z | NONE | When using xarray with the There could be a
(Could create a separate issue for this if preferred). |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
259202265 | https://github.com/pydata/xarray/issues/798#issuecomment-259202265 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1OTIwMjI2NQ== | shoyer 1217238 | 2016-11-08T17:27:55Z | 2016-11-08T22:19:11Z | MEMBER | A few other thoughts on thread safety with the LRU approach:
1. We need to a global lock ensure internal consistency of the LRU cache, and so that we don't overwrite files without closing them. It probably makes sense to put this in |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
259277405 | https://github.com/pydata/xarray/issues/798#issuecomment-259277405 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1OTI3NzQwNQ== | mrocklin 306380 | 2016-11-08T22:18:42Z | 2016-11-08T22:18:42Z | MEMBER | Yes. On Tue, Nov 8, 2016 at 5:17 PM, Florian Rathgeber notifications@github.com wrote:
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
259277067 | https://github.com/pydata/xarray/issues/798#issuecomment-259277067 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1OTI3NzA2Nw== | kynan 346079 | 2016-11-08T22:17:14Z | 2016-11-08T22:17:14Z | NONE | Great to see this moving! I take it the workshop was productive? How does #1095 work in the scenario of a distributed scheduler with remote workers? Do I understand correctly that all workers and the client would need to see the same shared filesystem from where NetCDF files are read? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
259185165 | https://github.com/pydata/xarray/issues/798#issuecomment-259185165 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1OTE4NTE2NQ== | shoyer 1217238 | 2016-11-08T16:28:13Z | 2016-11-08T16:28:13Z | MEMBER | One slight subtlety is writes. We'll need to switch from 'w' to 'a' mode the second time we open a file. On Tue, Nov 8, 2016 at 8:17 AM Matthew Rocklin notifications@github.com wrote:
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
259181856 | https://github.com/pydata/xarray/issues/798#issuecomment-259181856 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1OTE4MTg1Ng== | mrocklin 306380 | 2016-11-08T16:17:20Z | 2016-11-08T16:17:20Z | MEMBER | FYI Dask is committed to maintaining this: https://github.com/dask/zict/blob/master/zict/lru.py |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
259181526 | https://github.com/pydata/xarray/issues/798#issuecomment-259181526 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1OTE4MTUyNg== | shoyer 1217238 | 2016-11-08T16:16:15Z | 2016-11-08T16:16:15Z | MEMBER | We have something very hacky working with https://github.com/pydata/xarray/pull/1095 I'm also going to see if I can get something working with the LRU cache, since that seems closer to the solution we want eventually. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
257918615 | https://github.com/pydata/xarray/issues/798#issuecomment-257918615 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NzkxODYxNQ== | mrocklin 306380 | 2016-11-02T16:27:45Z | 2016-11-02T16:27:45Z | MEMBER | Custom serialization is in dask/distributed. This allows for us to build custom serialization solutions like the following for Any concerns would be very welcome. Earlier is better. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
257292279 | https://github.com/pydata/xarray/issues/798#issuecomment-257292279 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NzI5MjI3OQ== | mrocklin 306380 | 2016-10-31T13:24:01Z | 2016-10-31T14:49:31Z | MEMBER | I may have a solution to this in https://github.com/dask/distributed/pull/606, which allows for custom serialization formats to be registered with dask.distributed. We would register serialize and deserialize functions for the various netCDF objects. Something like the following might work for h5py: ``` python def serialize_dataset(dset): header = {} frames = [dset.filename.encode(), dset.datapath.encode()] return header, frames def deserialize_dataset(header, frames): filename, datapath = frames f = h5py.File(filename.decode()) dest = f[datapath.decode()] return dset register_serialization(h5py.Dataset, serialize_dataset, deserialize_dataset) ``` We still have lingering open files but not too many per machine. They'll move around the network, but only as necessary. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
257063608 | https://github.com/pydata/xarray/issues/798#issuecomment-257063608 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NzA2MzYwOA== | shoyer 1217238 | 2016-10-29T01:45:09Z | 2016-10-29T01:45:09Z | MEMBER |
Yes, this sounds quite promising to me. Using OpenDAP for communication is also possible, but if all we need to do is pass around serialized |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
257063168 | https://github.com/pydata/xarray/issues/798#issuecomment-257063168 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NzA2MzE2OA== | mrocklin 306380 | 2016-10-29T01:37:11Z | 2016-10-29T01:37:11Z | MEMBER | We could pull data from OpenDAP. Actually computing on those workers would probably be hard to integrate. Distributed Dask.array could possibly replace OpenDAP in some settings though, serving not only data, but also computation. |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
257062571 | https://github.com/pydata/xarray/issues/798#issuecomment-257062571 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NzA2MjU3MQ== | delgadom 3698640 | 2016-10-29T01:26:47Z | 2016-10-29T01:26:47Z | CONTRIBUTOR | Could this extend the OpeNDAP interface? That would solve the metadata problem and would provide quick access to the distributed workers. |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
256121613 | https://github.com/pydata/xarray/issues/798#issuecomment-256121613 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NjEyMTYxMw== | mrocklin 306380 | 2016-10-25T18:20:58Z | 2016-10-25T18:20:58Z | MEMBER | You wouldn't On Tue, Oct 25, 2016 at 9:43 AM, Florian Rathgeber <notifications@github.com
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
256038226 | https://github.com/pydata/xarray/issues/798#issuecomment-256038226 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NjAzODIyNg== | kynan 346079 | 2016-10-25T13:43:32Z | 2016-10-25T13:43:32Z | NONE | For the case where NetCDF / HDF5 files are only available on the distributed workers and not directly accessible from the client, how would you get the necessary metadata (coords, dims etc.) to construct the |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255800363 | https://github.com/pydata/xarray/issues/798#issuecomment-255800363 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTgwMDM2Mw== | mrocklin 306380 | 2016-10-24T17:00:58Z | 2016-10-24T17:00:58Z | MEMBER | One alternative would be to define custom serialization for I've been toying with the idea of custom serialization for dask.distributed recently. This was originally intended to let Dask make some opinionated serialization choices for some common formats (usually so that we can serialize numpy arrays and pandas dataframes faster than their generic pickle implementations allow) but this might also be helpful here to allow us to serialize netCDF4.Dataset objects and friends. We would define custom dumps and loads functions for netCDF4.Dataset objects that would presumably encode them as a filename and datapath. This would get around the open-many-files issue because the dataset would stay in the worker's One concern is that there are reasons why netCDF4.Dataset objects are not serializable (see https://github.com/h5py/h5py/issues/531). I'm not sure if this would affect XArray workloads. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255797423 | https://github.com/pydata/xarray/issues/798#issuecomment-255797423 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTc5NzQyMw== | shoyer 1217238 | 2016-10-24T16:50:15Z | 2016-10-24T16:50:15Z | MEMBER |
Yes, this could work for a proof of concept. In the long term, it would be good to integrate this into xarray so we can support alternative backends (e.g., h5netcdf, scipy, pynio, loaders for custom file formats like @rabernat and @pwolfram work with) in a fully consistent fashion without needing to make a separate wrapper for each. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255796531 | https://github.com/pydata/xarray/issues/798#issuecomment-255796531 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTc5NjUzMQ== | mrocklin 306380 | 2016-10-24T16:46:39Z | 2016-10-24T16:46:39Z | MEMBER | We seem to be making good progress here on the issue. I'm also happy to switch to real-time voice at any point today or tomorrow if people prefer. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255795874 | https://github.com/pydata/xarray/issues/798#issuecomment-255795874 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTc5NTg3NA== | mrocklin 306380 | 2016-10-24T16:44:10Z | 2016-10-24T16:44:10Z | MEMBER | We could possibly make an object that was API compatible with the subset of netCDF4.Dataset that you needed, but opened and closed the file whenever it actually pulled data. We would keep an LRU cache of open files around for efficiency as discussed earlier. In this case we could possibly optionally swap out the current netCDF4.Dataset object with this thing without much refactoring? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255794868 | https://github.com/pydata/xarray/issues/798#issuecomment-255794868 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTc5NDg2OA== | shoyer 1217238 | 2016-10-24T16:40:09Z | 2016-10-24T16:40:09Z | MEMBER | @mrocklin OK, that makes sense. In that case, we might indeed need to thread this through xarray's backends. Currently, backends open a file (e.g., with This currently doesn't use It seems like a version of xarray's backends that doesn't always open files immediately would make it suitable for use in dask.distributed. So indeed, we'll need to do some serious refactoring. One other thing that will need to be tackled eventually: |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255788269 | https://github.com/pydata/xarray/issues/798#issuecomment-255788269 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTc4ODI2OQ== | mrocklin 306380 | 2016-10-24T16:15:59Z | 2016-10-24T16:15:59Z | MEMBER | The The same approach could be used with XArray except that presumably we would need to do this for every relevant dataset within the NetCDF file. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255786548 | https://github.com/pydata/xarray/issues/798#issuecomment-255786548 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTc4NjU0OA== | shoyer 1217238 | 2016-10-24T16:10:14Z | 2016-10-24T16:10:40Z | MEMBER | I'm happy to help work out a plan here. It seems like there are basically two steps we need to make this happen:
1. Write the equivalent of It's an open question to what extent this needs to interact with xarray's internal |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255207705 | https://github.com/pydata/xarray/issues/798#issuecomment-255207705 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTIwNzcwNQ== | kynan 346079 | 2016-10-20T19:42:41Z | 2016-10-20T19:42:41Z | NONE | I'm probably not familiar enough with either the xarray or dask / distributed codebases to provide much input but would be happy to contribute if / where it makes sense. Would also be happy to be part of a some real-time discussion if feasible (based in the UK, so wouldn't be able to attend the workshop). |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255201276 | https://github.com/pydata/xarray/issues/798#issuecomment-255201276 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTIwMTI3Ng== | mrocklin 306380 | 2016-10-20T19:16:59Z | 2016-10-20T19:16:59Z | MEMBER | I agree that we should discuss it at the workshop. I also think it's possible that this could be accomplished by the right person (or combination of people) in a few hours. If so I think that we should come with it in hand as a capability that exists rather than a capability that should exist. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255200677 | https://github.com/pydata/xarray/issues/798#issuecomment-255200677 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTIwMDY3Nw== | rabernat 1197350 | 2016-10-20T19:14:34Z | 2016-10-20T19:14:34Z | MEMBER | This is a really important idea that has the potential to accelerate xarray from "medium data" to "big data". It should be planned out thoughtfully. My view is that we should implement a new DataStore class to handle distributed datasets. This could live in the xarray backend, or it could be a standalone package. Such a data store could be the foundation of a powerful platform for big-climate-data analysis. (Or maybe I am thinking too ambitiously.) I think the upcoming aospy workshop will be an ideal opportunity to discuss this, since many of the people on this thread will be face-to-face. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255194191 | https://github.com/pydata/xarray/issues/798#issuecomment-255194191 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTE5NDE5MQ== | mrocklin 306380 | 2016-10-20T18:48:53Z | 2016-10-20T18:48:53Z | MEMBER | I agree that this conversation needs expertise from a core xarray developer. I suspect that this change is more likely to happen in xarray than in dask.array. Happy to continue the conversation wherever. I do have a slight preference to switch to real-time at some point though. I suspect that we can hash this out in a moderate number of minutes. |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255192875 | https://github.com/pydata/xarray/issues/798#issuecomment-255192875 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTE5Mjg3NQ== | pwolfram 4295853 | 2016-10-20T18:44:03Z | 2016-10-20T18:44:03Z | CONTRIBUTOR | @mrocklin, I would be happy to chat because I am interested in seeing this happen (e.g., eventually contributing code). The question is whether we need additional expertise from @shoyer, @jhamman, @rabernat etc who likely have a greater in-depth understanding of xarray than me. Perhaps this warrants an email to the wider list? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255190556 | https://github.com/pydata/xarray/issues/798#issuecomment-255190556 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTE5MDU1Ng== | mrocklin 306380 | 2016-10-20T18:35:27Z | 2016-10-20T18:35:27Z | MEMBER | If XArray devs want to chat sometime I suspect we could hammer out a plan fairly quickly. My hope is that once a plan exists then a developer will arise to implement that plan. I'm free all of today and tomorrow. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255190289 | https://github.com/pydata/xarray/issues/798#issuecomment-255190289 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTE5MDI4OQ== | mrocklin 306380 | 2016-10-20T18:34:35Z | 2016-10-20T18:34:35Z | MEMBER | Definitely happy to support from the Dask side. I think that the LRU method described above is feasible. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255188697 | https://github.com/pydata/xarray/issues/798#issuecomment-255188697 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTE4ODY5Nw== | pwolfram 4295853 | 2016-10-20T18:28:24Z | 2016-10-20T18:28:24Z | CONTRIBUTOR | @kynan, I'm still interested in this but have not had time to advance this further. Are you interested in contributing to this too? I view this as a key component of future climate analysis workflows. This may also be something that is addressed at the upcoming hackathon at Columbia with @rabernat early next month. Also, I suspect that both @mrocklin and @shoyer would be willing to continue to provide key advice because this appears to be aligned with their interests too (please correct me if I'm wrong in this assessment). |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255187606 | https://github.com/pydata/xarray/issues/798#issuecomment-255187606 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTE4NzYwNg== | mrocklin 306380 | 2016-10-20T18:24:10Z | 2016-10-20T18:24:10Z | MEMBER | I haven't worked on this but agree that it is important. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
255184991 | https://github.com/pydata/xarray/issues/798#issuecomment-255184991 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDI1NTE4NDk5MQ== | kynan 346079 | 2016-10-20T18:14:38Z | 2016-10-20T18:14:38Z | NONE | Has this issue progressed since? Being able to distribute loading of files to a dask cluster and composing an xarray Is @mrocklin's blog post from Feb 2016 still the reference for remote data loading on a cluster? Adapting it to loading xarray Datasets rather than plain arrays is not straightforward since there is no way to combine futures representing Datasets out of the box. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
209005106 | https://github.com/pydata/xarray/issues/798#issuecomment-209005106 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwOTAwNTEwNg== | mrocklin 306380 | 2016-04-12T16:55:02Z | 2016-04-12T16:55:02Z | MEMBER | It's probably best to avoid futures within |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
205492861 | https://github.com/pydata/xarray/issues/798#issuecomment-205492861 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwNTQ5Mjg2MQ== | shoyer 1217238 | 2016-04-04T20:54:42Z | 2016-04-04T20:54:42Z | MEMBER |
Yes, this is correct. In principle, if we have a very large number of files containing many variables each, we might want to do the read in parallel using futures, and then use something like |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
205484614 | https://github.com/pydata/xarray/issues/798#issuecomment-205484614 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwNTQ4NDYxNA== | shoyer 1217238 | 2016-04-04T20:40:58Z | 2016-04-04T20:40:58Z | MEMBER | @pwolfram I was referring to this comment for @mrocklin's |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
205481557 | https://github.com/pydata/xarray/issues/798#issuecomment-205481557 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwNTQ4MTU1Nw== | pwolfram 4295853 | 2016-04-04T20:32:23Z | 2016-04-04T20:32:23Z | CONTRIBUTOR | @shoyer, if if we are happy to open all netCDF files and read out the metadata from a master process that would imply that we would open a file, read the metadata, and then close it, correct? Array access should then follow something like the @mrocklin's |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
205481269 | https://github.com/pydata/xarray/issues/798#issuecomment-205481269 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwNTQ4MTI2OQ== | pwolfram 4295853 | 2016-04-04T20:31:24Z | 2016-04-04T20:31:24Z | CONTRIBUTOR | @fmaussion, for
1. The LRU cache should be used serially for the read initially, but something more like @mrocklin's |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
205478991 | https://github.com/pydata/xarray/issues/798#issuecomment-205478991 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwNTQ3ODk5MQ== | pwolfram 4295853 | 2016-04-04T20:24:41Z | 2016-04-04T20:24:41Z | CONTRIBUTOR | Just to be clear, we are talking about this https://github.com/mrocklin/hdf5lazy/blob/master/hdf5lazy/core.py#L83 for @mrocklin's |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
205375803 | https://github.com/pydata/xarray/issues/798#issuecomment-205375803 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwNTM3NTgwMw== | shoyer 1217238 | 2016-04-04T16:25:03Z | 2016-04-04T16:25:03Z | MEMBER |
Correct, the LRU dict should be global. I believe the file restriction is generally per-process, and creating a global dict should assure that works properly.
The challenge is that we only call the My bigger concern was how to make use of a method like |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
205370162 | https://github.com/pydata/xarray/issues/798#issuecomment-205370162 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwNTM3MDE2Mg== | fmaussion 10050469 | 2016-04-04T16:08:57Z | 2016-04-04T16:08:57Z | MEMBER | Sorry if I am just producing noise here (I am not a specialist), but I have two naive questions: To 1. how will you handle concurrent access to the LRU cache if it's a global variable? To 2. Once the file has been closed by the LRU, won't it also be erased from it? So that a simple |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
205133433 | https://github.com/pydata/xarray/issues/798#issuecomment-205133433 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwNTEzMzQzMw== | pwolfram 4295853 | 2016-04-04T04:35:09Z | 2016-04-04T04:35:09Z | CONTRIBUTOR | Thanks @mrocklin! This has been really helpful and was what I needed to get going. A prelim design I'm seeing is to modify the A clean way to do this is just to make sure that each time Unless I'm missing something big, I don't think this change will require at large refactor but it is quite possible I overlooked something important. @shoyer and @mrocklin, do you see any obvious pitfalls in this scope? If not, it shouldn't be too hard to implement. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
204813232 | https://github.com/pydata/xarray/issues/798#issuecomment-204813232 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwNDgxMzIzMg== | mrocklin 306380 | 2016-04-02T22:29:04Z | 2016-04-02T22:29:04Z | MEMBER | FWIW I've uploaded a tiny LRU dict implementation to a new http://zict.readthedocs.org/en/latest/
There are a number of good alternatives out there though for LRU dictionaries. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
204770198 | https://github.com/pydata/xarray/issues/798#issuecomment-204770198 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwNDc3MDE5OA== | pwolfram 4295853 | 2016-04-02T18:25:18Z | 2016-04-02T18:25:18Z | CONTRIBUTOR | Another note in support of this PR, especially "robustly support HDF/NetCDF reads": I am having problems with |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
202696169 | https://github.com/pydata/xarray/issues/798#issuecomment-202696169 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwMjY5NjE2OQ== | pwolfram 4295853 | 2016-03-29T03:49:11Z | 2016-03-29T14:24:02Z | CONTRIBUTOR | Thanks @shoyer. If you can provide some guidance on bounds for the reorganization that would be really great. I want your and @jhamman's feedback on this before I try a solution. The trick is just to make the time, as always, and I may have some time this coming weekend. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
201134785 | https://github.com/pydata/xarray/issues/798#issuecomment-201134785 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwMTEzNDc4NQ== | shoyer 1217238 | 2016-03-25T04:54:09Z | 2016-03-25T04:54:09Z | MEMBER | I agree with @mrocklin that the LRUCache for file-like objects should take care of things from the dask.array perspective. It should also solve https://github.com/pydata/xarray/issues/463 in a very clean way. We'll just need to reorganize things a bit to make use of it. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
200901275 | https://github.com/pydata/xarray/issues/798#issuecomment-200901275 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwMDkwMTI3NQ== | mrocklin 306380 | 2016-03-24T16:00:52Z | 2016-03-24T16:00:52Z | MEMBER | I believe that robustly supporting HDF/NetCDF reads with the mechanism mentioned above will resolve most problems from a dask.array perspective. I have no doubt that other things will arise though. Switching from shared to distributed memory always come with (surmountable) obstacles |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
200878845 | https://github.com/pydata/xarray/issues/798#issuecomment-200878845 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwMDg3ODg0NQ== | pwolfram 4295853 | 2016-03-24T15:09:42Z | 2016-03-24T15:13:18Z | CONTRIBUTOR | This issue of connecting to dask/distributed may also be connected with https://github.com/pydata/xarray/issues/463, https://github.com/pydata/xarray/issues/591, and https://github.com/pydata/xarray/pull/524. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
200633312 | https://github.com/pydata/xarray/issues/798#issuecomment-200633312 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwMDYzMzMxMg== | pwolfram 4295853 | 2016-03-24T03:04:25Z | 2016-03-24T03:04:25Z | CONTRIBUTOR | Repeating @mrocklin:
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
200632521 | https://github.com/pydata/xarray/issues/798#issuecomment-200632521 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDIwMDYzMjUyMQ== | pwolfram 4295853 | 2016-03-24T03:02:19Z | 2016-03-24T03:02:19Z | CONTRIBUTOR | @shoyer and @mrocklin, I've updated the summary above in the PR description with a to do list. Do either of you see any obvious tasks I missed on the list in the PR description? If so, can you please update the to do list so that I can see what needs done to modify the backend for the dask/distributed integration? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
199547374 | https://github.com/pydata/xarray/issues/798#issuecomment-199547374 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDE5OTU0NzM3NA== | pwolfram 4295853 | 2016-03-22T00:01:55Z | 2016-03-22T00:02:11Z | CONTRIBUTOR | Here is an example of a use case for a |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
199545836 | https://github.com/pydata/xarray/issues/798#issuecomment-199545836 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDE5OTU0NTgzNg== | mrocklin 306380 | 2016-03-21T23:59:18Z | 2016-03-21T23:59:18Z | MEMBER | Copying over a comment from that issue: Yes, so the problem as I see it is that, for serialization and open-file reasons we want to use a function like the following:
However, this opens and closes many files, which while robust, is slow. We can alleviate this by maintaining an LRU cache in a global variable so that it is created separately per process. ``` python from toolz import memoize cache = LRUDict(size=100, on_eviction=lambda file: file.close()) netCDF4_Dataset = memoize(netCDF4.Dataset, cache=cache) def def get_chunk_of_array(filename, datapath, slice): f = netCDF4_Dataset(filename) return f.variables[datapath][slice] ``` I'm happy to supply the We would then need to use such a function within the dask.array and xarary codebases. Anyway, that's one approach. Thoughts welcome. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | |
199544731 | https://github.com/pydata/xarray/issues/798#issuecomment-199544731 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDE5OTU0NDczMQ== | pwolfram 4295853 | 2016-03-21T23:57:27Z | 2016-03-21T23:57:27Z | CONTRIBUTOR | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 | ||
199532452 | https://github.com/pydata/xarray/issues/798#issuecomment-199532452 | https://api.github.com/repos/pydata/xarray/issues/798 | MDEyOklzc3VlQ29tbWVudDE5OTUzMjQ1Mg== | pwolfram 4295853 | 2016-03-21T23:21:07Z | 2016-03-21T23:21:07Z | CONTRIBUTOR | The full mailing list discussion is at https://groups.google.com/d/msgid/xarray/CAJ8oX-E7Xx6NT4F6J8B4__Q-kBazoob9_qe_oFLi5hany9-%3DKQ%40mail.gmail.com?utm_medium=email&utm_source=footer |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Integration with dask/distributed (xarray backend design) 142498006 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 8