html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/798#issuecomment-453800654,https://api.github.com/repos/pydata/xarray/issues/798,453800654,MDEyOklzc3VlQ29tbWVudDQ1MzgwMDY1NA==,2443309,2019-01-13T04:12:32Z,2019-01-13T04:12:32Z,MEMBER,Closing this old issue. The final checkbox in @pwolfram's original post was completed in #2261. ,"{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-305506896,https://api.github.com/repos/pydata/xarray/issues/798,305506896,MDEyOklzc3VlQ29tbWVudDMwNTUwNjg5Ng==,306380,2017-06-01T14:17:11Z,2017-06-01T14:17:11Z,MEMBER,"@shoyer regarding per-file locking this probably only matters if we are writing as well, yes?
Here is a small implementation of a generic file-open cache. I haven't yet decided on a eviction policy but either LRU or random (filtered by closeable files) should work OK.
```python
from contextlib import contextmanager
import threading
class OpenCache(object):
def __init__(self, maxsize=100):
self.refcount = defaultdict(lambda: 0)
self.maxsize = 0
self.cache = {}
self.i = 0
self.lock = threading.Lock()
@contextmanager
def open(self, myopen, fn, mode='r'):
assert 'r' in mode
key = (myopen, fn, mode)
with self.lock:
try:
file = self.cache[key]
except KeyError:
file = myopen(fn, mode=mode)
self.cache[key] = file
self.refcount[key] += 1
if len(self.cache) > self.maxsize:
# Clear old files intelligently
try:
yield file
finally:
with self.lock:
self.refcount[key] -= 1
cache = OpenCache()
with cache.open(h5py.File, 'myfile.hdf5') as f:
x = f['/data/x']
y = x[:1000, :1000]
```
Is this still useful?
I'm curious to hear from users like @pwolfram and @rabernat who may be running into the many file problem about what the current pain points are.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-288415152,https://api.github.com/repos/pydata/xarray/issues/798,288415152,MDEyOklzc3VlQ29tbWVudDI4ODQxNTE1Mg==,306380,2017-03-22T14:26:08Z,2017-03-22T14:26:08Z,MEMBER,Has anyone used XArray on NetCDF data on cluster without resorting to any tricks?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-263470325,https://api.github.com/repos/pydata/xarray/issues/798,263470325,MDEyOklzc3VlQ29tbWVudDI2MzQ3MDMyNQ==,306380,2016-11-29T04:02:05Z,2016-11-29T04:02:05Z,MEMBER,"A lock on the LRU cache makes sense to me.
> We need separate, per file locks, to ensure that we don't evict files in the process of reading or writing data from them (which would cause segfaults). As a stop-gap measure, we could simply refuse to evict files until we can acquire a lock, but more broadly this suggests that strict LRU is not quite right. Instead, we want to evict the least-recently-used unlocked item
If it were me I would just block on the evicted file until it becomes available (the stop-gap measure) until it became a performance problem.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-263431065,https://api.github.com/repos/pydata/xarray/issues/798,263431065,MDEyOklzc3VlQ29tbWVudDI2MzQzMTA2NQ==,1217238,2016-11-28T23:42:54Z,2016-11-28T23:42:54Z,MEMBER,"@mrocklin Any thoughts on my thread safety concerns (https://github.com/pydata/xarray/issues/798#issuecomment-259202265) for the LRU cache? I suppose the simplest thing to do is to simply refuse to evict a file until the per-file lock is released, but I can see that strategy failing pretty badly in edge cases.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-262236329,https://api.github.com/repos/pydata/xarray/issues/798,262236329,MDEyOklzc3VlQ29tbWVudDI2MjIzNjMyOQ==,306380,2016-11-22T13:10:03Z,2016-11-22T13:11:48Z,MEMBER,One solution is to create protocols on the Dask side to enable `dask.distributed.Client.persist` itself to work on XArray objects. This keeps the scheduler specific details like persist on the scheduler.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-259202265,https://api.github.com/repos/pydata/xarray/issues/798,259202265,MDEyOklzc3VlQ29tbWVudDI1OTIwMjI2NQ==,1217238,2016-11-08T17:27:55Z,2016-11-08T22:19:11Z,MEMBER,"A few other thoughts on thread safety with the LRU approach:
1. We need to a global lock ensure internal consistency of the LRU cache, and so that we don't overwrite files without closing them. It probably makes sense to put this in `memoize` function.
2. We need separate, per file locks, to ensure that we don't evict files in the process of reading or writing data from them (which would cause segfaults). As a stop-gap measure, we could simply refuse to evict files until we can acquire a lock, but more broadly this suggests that strict LRU is not quite right. Instead, we want to evict the least-recently-used unlocked item.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-259277405,https://api.github.com/repos/pydata/xarray/issues/798,259277405,MDEyOklzc3VlQ29tbWVudDI1OTI3NzQwNQ==,306380,2016-11-08T22:18:42Z,2016-11-08T22:18:42Z,MEMBER,"Yes.
On Tue, Nov 8, 2016 at 5:17 PM, Florian Rathgeber notifications@github.com
wrote:
> Great to see this moving! I take it the workshop was productive?
>
> How does #1095 https://github.com/pydata/xarray/pull/1095 work in the
> scenario of a distributed scheduler with remote workers? Do I understand
> correctly that all workers and the client would need to see the same shared
> filesystem from where NetCDF files are read?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> https://github.com/pydata/xarray/issues/798#issuecomment-259277067, or mute
> the thread
> https://github.com/notifications/unsubscribe-auth/AASszCe45oN0_1tBsrCycyr2N01M75xNks5q8PTsgaJpZM4H1p4q
> .
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-259185165,https://api.github.com/repos/pydata/xarray/issues/798,259185165,MDEyOklzc3VlQ29tbWVudDI1OTE4NTE2NQ==,1217238,2016-11-08T16:28:13Z,2016-11-08T16:28:13Z,MEMBER,"One slight subtlety is writes. We'll need to switch from 'w' to 'a' mode
the second time we open a file.
On Tue, Nov 8, 2016 at 8:17 AM Matthew Rocklin notifications@github.com
wrote:
> FYI Dask is committed to maintaining this:
> https://github.com/dask/zict/blob/master/zict/lru.py
>
> —
> You are receiving this because you were mentioned.
>
> Reply to this email directly, view it on GitHub
> https://github.com/pydata/xarray/issues/798#issuecomment-259181856, or mute
> the thread
> https://github.com/notifications/unsubscribe-auth/ABKS1rz8sYoBXjMbJvQqrP3XHZx3_fJhks5q8KCRgaJpZM4H1p4q
> .
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-259181856,https://api.github.com/repos/pydata/xarray/issues/798,259181856,MDEyOklzc3VlQ29tbWVudDI1OTE4MTg1Ng==,306380,2016-11-08T16:17:20Z,2016-11-08T16:17:20Z,MEMBER,"FYI Dask is committed to maintaining this: https://github.com/dask/zict/blob/master/zict/lru.py
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-259181526,https://api.github.com/repos/pydata/xarray/issues/798,259181526,MDEyOklzc3VlQ29tbWVudDI1OTE4MTUyNg==,1217238,2016-11-08T16:16:15Z,2016-11-08T16:16:15Z,MEMBER,"We have something very hacky working with https://github.com/pydata/xarray/pull/1095
I'm also going to see if I can get something working with the LRU cache, since that seems closer to the solution we want eventually.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-257918615,https://api.github.com/repos/pydata/xarray/issues/798,257918615,MDEyOklzc3VlQ29tbWVudDI1NzkxODYxNQ==,306380,2016-11-02T16:27:45Z,2016-11-02T16:27:45Z,MEMBER,"Custom serialization is in dask/distributed. This allows for us to build custom serialization solutions like the following for `h5py.Dataset`: https://github.com/dask/distributed/pull/620/files
Any concerns would be very welcome. Earlier is better.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-257292279,https://api.github.com/repos/pydata/xarray/issues/798,257292279,MDEyOklzc3VlQ29tbWVudDI1NzI5MjI3OQ==,306380,2016-10-31T13:24:01Z,2016-10-31T14:49:31Z,MEMBER,"I may have a solution to this in https://github.com/dask/distributed/pull/606, which allows for custom serialization formats to be registered with dask.distributed. We would register serialize and deserialize functions for the various netCDF objects. Something like the following might work for h5py:
``` python
def serialize_dataset(dset):
header = {}
frames = [dset.filename.encode(), dset.datapath.encode()]
return header, frames
def deserialize_dataset(header, frames):
filename, datapath = frames
f = h5py.File(filename.decode())
dest = f[datapath.decode()]
return dset
register_serialization(h5py.Dataset, serialize_dataset, deserialize_dataset)
```
We still have lingering open files but not too many per machine. They'll move around the network, but only as necessary.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-257063608,https://api.github.com/repos/pydata/xarray/issues/798,257063608,MDEyOklzc3VlQ29tbWVudDI1NzA2MzYwOA==,1217238,2016-10-29T01:45:09Z,2016-10-29T01:45:09Z,MEMBER,"> Distributed Dask.array could possibly replace OpenDAP in some settings though
Yes, this sounds quite promising to me.
Using OpenDAP for communication is also possible, but if all we need to do is pass around serialized `xarray.Dataset` objects using pickle or even bytes from netCDF files seems more promising.
","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-257063168,https://api.github.com/repos/pydata/xarray/issues/798,257063168,MDEyOklzc3VlQ29tbWVudDI1NzA2MzE2OA==,306380,2016-10-29T01:37:11Z,2016-10-29T01:37:11Z,MEMBER,"We could pull data from OpenDAP. Actually computing on those workers would probably be hard to integrate. Distributed Dask.array could possibly replace OpenDAP in some settings though, serving not only data, but also computation.
","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-256121613,https://api.github.com/repos/pydata/xarray/issues/798,256121613,MDEyOklzc3VlQ29tbWVudDI1NjEyMTYxMw==,306380,2016-10-25T18:20:58Z,2016-10-25T18:20:58Z,MEMBER,"You wouldn't
On Tue, Oct 25, 2016 at 9:43 AM, Florian Rathgeber wrote:
>
> For the case where NetCDF / HDF5 files are only available on the
> distributed workers and not directly accessible from the client, how would
> you get the necessary metadata (coords, dims etc.) to construct the
> xarray.Dataset?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> https://github.com/pydata/xarray/issues/798#issuecomment-256038226, or mute
> the thread
> https://github.com/notifications/unsubscribe-auth/AASszIYbrYoqqJMwu5FFoxu5SWSJSTnoks5q3geGgaJpZM4H1p4q
> .
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255800363,https://api.github.com/repos/pydata/xarray/issues/798,255800363,MDEyOklzc3VlQ29tbWVudDI1NTgwMDM2Mw==,306380,2016-10-24T17:00:58Z,2016-10-24T17:00:58Z,MEMBER,"One alternative would be to define custom serialization for `netCDF4.Dataset` objects.
I've been toying with the idea of custom serialization for dask.distributed recently. This was originally intended to let Dask make some opinionated serialization choices for some common formats (usually so that we can serialize numpy arrays and pandas dataframes faster than their generic pickle implementations allow) but this might also be helpful here to allow us to serialize netCDF4.Dataset objects and friends.
We would define custom dumps and loads functions for netCDF4.Dataset objects that would presumably encode them as a filename and datapath. This would get around the open-many-files issue because the dataset would stay in the worker's `.data` dictionary while it was needed.
One concern is that there are reasons why netCDF4.Dataset objects are not serializable (see https://github.com/h5py/h5py/issues/531). I'm not sure if this would affect XArray workloads.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255797423,https://api.github.com/repos/pydata/xarray/issues/798,255797423,MDEyOklzc3VlQ29tbWVudDI1NTc5NzQyMw==,1217238,2016-10-24T16:50:15Z,2016-10-24T16:50:15Z,MEMBER,"> We could possibly make an object that was API compatible with the subset of netCDF4.Dataset that you needed, but opened and closed the file whenever it actually pulled data. We would keep an LRU cache of open files around for efficiency as discussed earlier. In this case we could possibly optionally swap out the current netCDF4.Dataset object with this thing without much refactoring?
Yes, this could work for a proof of concept.
In the long term, it would be good to integrate this into xarray so we can support alternative backends (e.g., h5netcdf, scipy, pynio, loaders for custom file formats like @rabernat and @pwolfram work with) in a fully consistent fashion without needing to make a separate wrapper for each.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255796531,https://api.github.com/repos/pydata/xarray/issues/798,255796531,MDEyOklzc3VlQ29tbWVudDI1NTc5NjUzMQ==,306380,2016-10-24T16:46:39Z,2016-10-24T16:46:39Z,MEMBER,"We seem to be making good progress here on the issue. I'm also happy to switch to real-time voice at any point today or tomorrow if people prefer.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255795874,https://api.github.com/repos/pydata/xarray/issues/798,255795874,MDEyOklzc3VlQ29tbWVudDI1NTc5NTg3NA==,306380,2016-10-24T16:44:10Z,2016-10-24T16:44:10Z,MEMBER,"We could possibly make an object that was API compatible with the subset of netCDF4.Dataset that you needed, but opened and closed the file whenever it actually pulled data. We would keep an LRU cache of open files around for efficiency as discussed earlier. In this case we could possibly optionally swap out the current netCDF4.Dataset object with this thing without much refactoring?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255794868,https://api.github.com/repos/pydata/xarray/issues/798,255794868,MDEyOklzc3VlQ29tbWVudDI1NTc5NDg2OA==,1217238,2016-10-24T16:40:09Z,2016-10-24T16:40:09Z,MEMBER,"@mrocklin OK, that makes sense. In that case, we might indeed need to thread this through xarray's backends.
Currently, backends open a file (e.g., with `netCDF4.Dataset`) and create an OrderedDict of `xarray.Variable` objects with lazy arrays that load from the file on demand. To load this data with dask, pass these lazy arrays into `dask.array.from_array`.
This currently doesn't use `dask.delayed` for three reasons:
1. Historical: we wrote this system before dask existed.
2. Performance: our `LazilyIndexedArray` class is still more selective than `dask.array` for subsetting data from large chunks, which is essential for many interactive use cases. Despite getitem fusing, dask will sometimes load complete chunks. This is particularly true if we do some transformation of the array, of the sort that could be accomplished with dask's `map_blocks`. Using `LazilyIndexedArray` ensures that this only gets applied to loaded data. There are also performance benefits to keeping files open when possible (discussed above).
3. Dependencies: dask is still an optional dependency for xarray. I'd like to keep it that way, if possible.
It seems like a version of xarray's backends that doesn't always open files immediately would make it suitable for use in dask.distributed. So indeed, we'll need to do some serious refactoring.
One other thing that will need to be tackled eventually: `xarray.merge` and `xarray.concat` (used in `open_mfdataset`) still have some steps (checking for equality between arrays) that are applied sequentially. This is going to be a performance bottleneck when we start working with very large arrays. This really should be refactored such that dask can do these evaluations in a single step, rather than once per object. For now, this can be avoided in `concat` by using the `data_vars`/`coords` options.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255788269,https://api.github.com/repos/pydata/xarray/issues/798,255788269,MDEyOklzc3VlQ29tbWVudDI1NTc4ODI2OQ==,306380,2016-10-24T16:15:59Z,2016-10-24T16:15:59Z,MEMBER,"The `futures_to_dask_arrays` function has been deprecated at this point. The standard way to produce a distributed dask.array from custom functions is as follows:
- Use dask.delayed to construct many lazy numpy arrays individually
- Wrap each of these into a single-chunk dask.array using `da.from_delayed(lazy_value, shape=..., dtype=...)`
- Use `da.stack` or `da.concat` to arrange these single-chunk dask.arrays into a larger dask.array.
The same approach could be used with XArray except that presumably we would need to do this for every relevant dataset within the NetCDF file.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255786548,https://api.github.com/repos/pydata/xarray/issues/798,255786548,MDEyOklzc3VlQ29tbWVudDI1NTc4NjU0OA==,1217238,2016-10-24T16:10:14Z,2016-10-24T16:10:40Z,MEMBER,"I'm happy to help work out a plan here.
It seems like there are basically two steps we need to make this happen:
1. Write the equivalent of `futures_to_dask_arrays` for `xarray.Dataset`, i.e., `futures_to_xarray_datasets_of_dask_arrays`.
2. Integrate this into xarray's higher level utility functions like `open_mfdataset`. This should be pretty easy after we have `futures_to_xarray_datasets_of_dask_arrays`.
It's an open question to what extent this needs to interact with xarray's internal `backends.DataStore` API, which handles the details of decoding files on disk to `xarray.Dataset` objects. I'm hopeful the answer is ""not very much"". The `DataStore` API is a bit cumbersome and overly complex, and could use a refactoring.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255201276,https://api.github.com/repos/pydata/xarray/issues/798,255201276,MDEyOklzc3VlQ29tbWVudDI1NTIwMTI3Ng==,306380,2016-10-20T19:16:59Z,2016-10-20T19:16:59Z,MEMBER,"I agree that we should discuss it at the workshop. I also think it's possible that this could be accomplished by the right person (or combination of people) in a few hours. If so I think that we should come with it in hand as a capability that exists rather than a capability that should exist.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255200677,https://api.github.com/repos/pydata/xarray/issues/798,255200677,MDEyOklzc3VlQ29tbWVudDI1NTIwMDY3Nw==,1197350,2016-10-20T19:14:34Z,2016-10-20T19:14:34Z,MEMBER,"This is a really important idea that has the potential to accelerate xarray from ""medium data"" to ""big data"". It should be planned out thoughtfully.
My view is that we should implement a new DataStore class to handle distributed datasets. This could live in the xarray backend, or it could be a standalone package. Such a data store could be the foundation of a powerful platform for big-climate-data analysis. (Or maybe I am thinking too ambitiously.)
I think the upcoming [aospy workshop](https://rabernat.github.io/aospy-workshop/) will be an ideal opportunity to discuss this, since many of the people on this thread will be face-to-face.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255194191,https://api.github.com/repos/pydata/xarray/issues/798,255194191,MDEyOklzc3VlQ29tbWVudDI1NTE5NDE5MQ==,306380,2016-10-20T18:48:53Z,2016-10-20T18:48:53Z,MEMBER,"I agree that this conversation needs expertise from a core xarray developer. I suspect that this change is more likely to happen in xarray than in dask.array. Happy to continue the conversation wherever. I do have a slight preference to switch to real-time at some point though. I suspect that we can hash this out in a moderate number of minutes.
","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255190556,https://api.github.com/repos/pydata/xarray/issues/798,255190556,MDEyOklzc3VlQ29tbWVudDI1NTE5MDU1Ng==,306380,2016-10-20T18:35:27Z,2016-10-20T18:35:27Z,MEMBER,"If XArray devs want to chat sometime I suspect we could hammer out a plan fairly quickly. My hope is that once a plan exists then a developer will arise to implement that plan. I'm free all of today and tomorrow.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255190289,https://api.github.com/repos/pydata/xarray/issues/798,255190289,MDEyOklzc3VlQ29tbWVudDI1NTE5MDI4OQ==,306380,2016-10-20T18:34:35Z,2016-10-20T18:34:35Z,MEMBER,"Definitely happy to support from the Dask side.
I think that the LRU method described above is feasible.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-255187606,https://api.github.com/repos/pydata/xarray/issues/798,255187606,MDEyOklzc3VlQ29tbWVudDI1NTE4NzYwNg==,306380,2016-10-20T18:24:10Z,2016-10-20T18:24:10Z,MEMBER,"I haven't worked on this but agree that it is important.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-209005106,https://api.github.com/repos/pydata/xarray/issues/798,209005106,MDEyOklzc3VlQ29tbWVudDIwOTAwNTEwNg==,306380,2016-04-12T16:55:02Z,2016-04-12T16:55:02Z,MEMBER,"It's probably best to avoid futures within `xarray`, so far they're only in the distributed memory scheduler. I think that ideally we create graphs that can be used robustly in either. I think that the memoized `netCDF4_Dataset` approach can _probably_ do this just fine. Is there anything that is needed from me to help push this forward?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-205492861,https://api.github.com/repos/pydata/xarray/issues/798,205492861,MDEyOklzc3VlQ29tbWVudDIwNTQ5Mjg2MQ==,1217238,2016-04-04T20:54:42Z,2016-04-04T20:54:42Z,MEMBER,"> @shoyer, if if we are happy to open all netCDF files and read out the metadata from a master process that would imply that we would open a file, read the metadata, and then close it, correct?
>
> Array access should then follow something like the @mrocklin's netcdf_Dataset approach, right?
Yes, this is correct.
In principle, if we have a very large number of files containing many variables each, we might want to do the read in parallel using futures, and then use something like `futures_to_dask_arrays` to bring them together. That seems much trickier to integrate into our current backend approach.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-205484614,https://api.github.com/repos/pydata/xarray/issues/798,205484614,MDEyOklzc3VlQ29tbWVudDIwNTQ4NDYxNA==,1217238,2016-04-04T20:40:58Z,2016-04-04T20:40:58Z,MEMBER,"@pwolfram I was referring to [this comment](https://github.com/pydata/xarray/issues/798#issuecomment-199545836) for @mrocklin's `netCDF4_Dataset`.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-205375803,https://api.github.com/repos/pydata/xarray/issues/798,205375803,MDEyOklzc3VlQ29tbWVudDIwNTM3NTgwMw==,1217238,2016-04-04T16:25:03Z,2016-04-04T16:25:03Z,MEMBER,"> I think the LRU dict has to be a global because because the file restriction is an attribute of the system, correct?
Correct, the LRU dict should be global. I believe the file restriction is generally per-process, and creating a global dict should assure that works properly.
> For each read from a file, ensure it hasn't been closed via a @ds.getter property method. If so, reopen it via the LRU cache. This is ok because for a read the file is essentially read-only. The LRU closes out stale entries to prevent the too many open file errors. Checking this should be fast.
The challenge is that we only call the `.get_variables()` method (and hence `self.ds`) once on a DataStore when a Dataset is opened from disk. I think we need to refactor `NetCDF4ArrayWrapper` to take a filename instead, and use something like @mrocklin's `netcdf_Dataset`.
My bigger concern was how to make use of a method like `futures_to_dask_arrays`. But it looks like that may actually not be necessary, at least if we are happy to open all netCDF files (and read out the metadata) from a master process.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-205370162,https://api.github.com/repos/pydata/xarray/issues/798,205370162,MDEyOklzc3VlQ29tbWVudDIwNTM3MDE2Mg==,10050469,2016-04-04T16:08:57Z,2016-04-04T16:08:57Z,MEMBER,"Sorry if I am just producing noise here (I am not a specialist), but I have two naive questions:
To 1. how will you handle concurrent access to the LRU cache if it's a global variable?
To 2. Once the file has been closed by the LRU, won't it also be erased from it? So that a simple `if file in LRU:` could suffice as a test if the file has been closed or not?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-204813232,https://api.github.com/repos/pydata/xarray/issues/798,204813232,MDEyOklzc3VlQ29tbWVudDIwNDgxMzIzMg==,306380,2016-04-02T22:29:04Z,2016-04-02T22:29:04Z,MEMBER,"FWIW I've uploaded a tiny LRU dict implementation to a new `zict` project (which also has some other stuff):
http://zict.readthedocs.org/en/latest/
```
pip install zict
```
``` python
from zict import LRU
d = LRU(100, dict())
```
There are a number of good alternatives out there though for LRU dictionaries.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-201134785,https://api.github.com/repos/pydata/xarray/issues/798,201134785,MDEyOklzc3VlQ29tbWVudDIwMTEzNDc4NQ==,1217238,2016-03-25T04:54:09Z,2016-03-25T04:54:09Z,MEMBER,"I agree with @mrocklin that the LRUCache for file-like objects should take care of things from the dask.array perspective. It should also solve https://github.com/pydata/xarray/issues/463 in a very clean way. We'll just need to reorganize things a bit to make use of it.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-200901275,https://api.github.com/repos/pydata/xarray/issues/798,200901275,MDEyOklzc3VlQ29tbWVudDIwMDkwMTI3NQ==,306380,2016-03-24T16:00:52Z,2016-03-24T16:00:52Z,MEMBER,"I believe that robustly supporting HDF/NetCDF reads with the mechanism mentioned above will resolve most problems from a dask.array perspective. I have no doubt that other things will arise though. Switching from shared to distributed memory always come with (surmountable) obstacles
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006
https://github.com/pydata/xarray/issues/798#issuecomment-199545836,https://api.github.com/repos/pydata/xarray/issues/798,199545836,MDEyOklzc3VlQ29tbWVudDE5OTU0NTgzNg==,306380,2016-03-21T23:59:18Z,2016-03-21T23:59:18Z,MEMBER,"Copying over [a comment](https://github.com/dask/dask/issues/922#issuecomment-199085431) from that issue:
Yes, so the problem as I see it is that, for serialization and open-file reasons we want to use a function like the following:
``` python
def get_chunk_of_array(filename, datapath, slice):
with netCDF4.Dataset(filename) as f:
return f.variables[datapath][slice]
```
However, this opens and closes many files, which while robust, is slow. We can alleviate this by maintaining an LRU cache in a global variable so that it is created separately per process.
``` python
from toolz import memoize
cache = LRUDict(size=100, on_eviction=lambda file: file.close())
netCDF4_Dataset = memoize(netCDF4.Dataset, cache=cache)
def def get_chunk_of_array(filename, datapath, slice):
f = netCDF4_Dataset(filename)
return f.variables[datapath][slice]
```
I'm happy to supply the `memoize` function with `toolz` and an appropriate `LRUDict` object with other microprojects that I can publish if necessary.
We would then need to use such a function within the dask.array and xarary codebases.
Anyway, that's one approach. Thoughts welcome.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,142498006