home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

59 rows where issue = 142498006 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 8

  • mrocklin 23
  • pwolfram 15
  • shoyer 12
  • kynan 5
  • rabernat 1
  • jhamman 1
  • delgadom 1
  • fmaussion 1

author_association 3

  • MEMBER 38
  • CONTRIBUTOR 16
  • NONE 5

issue 1

  • Integration with dask/distributed (xarray backend design) · 59 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
453800654 https://github.com/pydata/xarray/issues/798#issuecomment-453800654 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDQ1MzgwMDY1NA== jhamman 2443309 2019-01-13T04:12:32Z 2019-01-13T04:12:32Z MEMBER

Closing this old issue. The final checkbox in @pwolfram's original post was completed in #2261.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
305506896 https://github.com/pydata/xarray/issues/798#issuecomment-305506896 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDMwNTUwNjg5Ng== mrocklin 306380 2017-06-01T14:17:11Z 2017-06-01T14:17:11Z MEMBER

@shoyer regarding per-file locking this probably only matters if we are writing as well, yes?

Here is a small implementation of a generic file-open cache. I haven't yet decided on a eviction policy but either LRU or random (filtered by closeable files) should work OK.

```python from contextlib import contextmanager import threading

class OpenCache(object): def init(self, maxsize=100): self.refcount = defaultdict(lambda: 0) self.maxsize = 0 self.cache = {} self.i = 0 self.lock = threading.Lock()

@contextmanager
def open(self, myopen, fn, mode='r'):
    assert 'r' in mode
    key = (myopen, fn, mode)
    with self.lock:
        try:
            file = self.cache[key]
        except KeyError:
            file = myopen(fn, mode=mode)
            self.cache[key] = file

        self.refcount[key] += 1

        if len(self.cache) > self.maxsize:
            # Clear old files intelligently

    try:
        yield file
    finally:
        with self.lock:
            self.refcount[key] -= 1

cache = OpenCache() with cache.open(h5py.File, 'myfile.hdf5') as f: x = f['/data/x'] y = x[:1000, :1000] ```

Is this still useful?

I'm curious to hear from users like @pwolfram and @rabernat who may be running into the many file problem about what the current pain points are.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
288415152 https://github.com/pydata/xarray/issues/798#issuecomment-288415152 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI4ODQxNTE1Mg== mrocklin 306380 2017-03-22T14:26:08Z 2017-03-22T14:26:08Z MEMBER

Has anyone used XArray on NetCDF data on cluster without resorting to any tricks?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
288414396 https://github.com/pydata/xarray/issues/798#issuecomment-288414396 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI4ODQxNDM5Ng== pwolfram 4295853 2017-03-22T14:23:45Z 2017-03-22T14:23:45Z CONTRIBUTOR

@mrocklin and @shoyer, we now have dask.distributed and xarray support. Should this issue be closed?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
263470325 https://github.com/pydata/xarray/issues/798#issuecomment-263470325 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI2MzQ3MDMyNQ== mrocklin 306380 2016-11-29T04:02:05Z 2016-11-29T04:02:05Z MEMBER

A lock on the LRU cache makes sense to me.

We need separate, per file locks, to ensure that we don't evict files in the process of reading or writing data from them (which would cause segfaults). As a stop-gap measure, we could simply refuse to evict files until we can acquire a lock, but more broadly this suggests that strict LRU is not quite right. Instead, we want to evict the least-recently-used unlocked item

If it were me I would just block on the evicted file until it becomes available (the stop-gap measure) until it became a performance problem.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
263431065 https://github.com/pydata/xarray/issues/798#issuecomment-263431065 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI2MzQzMTA2NQ== shoyer 1217238 2016-11-28T23:42:54Z 2016-11-28T23:42:54Z MEMBER

@mrocklin Any thoughts on my thread safety concerns (https://github.com/pydata/xarray/issues/798#issuecomment-259202265) for the LRU cache? I suppose the simplest thing to do is to simply refuse to evict a file until the per-file lock is released, but I can see that strategy failing pretty badly in edge cases.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
262236329 https://github.com/pydata/xarray/issues/798#issuecomment-262236329 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI2MjIzNjMyOQ== mrocklin 306380 2016-11-22T13:10:03Z 2016-11-22T13:11:48Z MEMBER

One solution is to create protocols on the Dask side to enable dask.distributed.Client.persist itself to work on XArray objects. This keeps the scheduler specific details like persist on the scheduler.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
262214999 https://github.com/pydata/xarray/issues/798#issuecomment-262214999 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI2MjIxNDk5OQ== kynan 346079 2016-11-22T11:18:56Z 2016-11-22T11:18:56Z NONE

When using xarray with the dask.distributed scheduler it would be useful to be able to persist intermediate DataArrays / Datasets on remote workers.

There could be a persist method analogous to the compute method introduced in #1024. Potential issues with this approach are:

  1. What are the semantics of this operation for the general case where dask or distributed are not used?
  2. Is it justified to add an operation which is rather specific to the distributed scheduler?

(Could create a separate issue for this if preferred).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
259202265 https://github.com/pydata/xarray/issues/798#issuecomment-259202265 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1OTIwMjI2NQ== shoyer 1217238 2016-11-08T17:27:55Z 2016-11-08T22:19:11Z MEMBER

A few other thoughts on thread safety with the LRU approach: 1. We need to a global lock ensure internal consistency of the LRU cache, and so that we don't overwrite files without closing them. It probably makes sense to put this in memoize function. 2. We need separate, per file locks, to ensure that we don't evict files in the process of reading or writing data from them (which would cause segfaults). As a stop-gap measure, we could simply refuse to evict files until we can acquire a lock, but more broadly this suggests that strict LRU is not quite right. Instead, we want to evict the least-recently-used unlocked item.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
259277405 https://github.com/pydata/xarray/issues/798#issuecomment-259277405 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1OTI3NzQwNQ== mrocklin 306380 2016-11-08T22:18:42Z 2016-11-08T22:18:42Z MEMBER

Yes.

On Tue, Nov 8, 2016 at 5:17 PM, Florian Rathgeber notifications@github.com wrote:

Great to see this moving! I take it the workshop was productive?

How does #1095 https://github.com/pydata/xarray/pull/1095 work in the scenario of a distributed scheduler with remote workers? Do I understand correctly that all workers and the client would need to see the same shared filesystem from where NetCDF files are read?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/798#issuecomment-259277067, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszCe45oN0_1tBsrCycyr2N01M75xNks5q8PTsgaJpZM4H1p4q .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
259277067 https://github.com/pydata/xarray/issues/798#issuecomment-259277067 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1OTI3NzA2Nw== kynan 346079 2016-11-08T22:17:14Z 2016-11-08T22:17:14Z NONE

Great to see this moving! I take it the workshop was productive?

How does #1095 work in the scenario of a distributed scheduler with remote workers? Do I understand correctly that all workers and the client would need to see the same shared filesystem from where NetCDF files are read?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
259185165 https://github.com/pydata/xarray/issues/798#issuecomment-259185165 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1OTE4NTE2NQ== shoyer 1217238 2016-11-08T16:28:13Z 2016-11-08T16:28:13Z MEMBER

One slight subtlety is writes. We'll need to switch from 'w' to 'a' mode the second time we open a file. On Tue, Nov 8, 2016 at 8:17 AM Matthew Rocklin notifications@github.com wrote:

FYI Dask is committed to maintaining this: https://github.com/dask/zict/blob/master/zict/lru.py

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/798#issuecomment-259181856, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1rz8sYoBXjMbJvQqrP3XHZx3_fJhks5q8KCRgaJpZM4H1p4q .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
259181856 https://github.com/pydata/xarray/issues/798#issuecomment-259181856 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1OTE4MTg1Ng== mrocklin 306380 2016-11-08T16:17:20Z 2016-11-08T16:17:20Z MEMBER

FYI Dask is committed to maintaining this: https://github.com/dask/zict/blob/master/zict/lru.py

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
259181526 https://github.com/pydata/xarray/issues/798#issuecomment-259181526 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1OTE4MTUyNg== shoyer 1217238 2016-11-08T16:16:15Z 2016-11-08T16:16:15Z MEMBER

We have something very hacky working with https://github.com/pydata/xarray/pull/1095

I'm also going to see if I can get something working with the LRU cache, since that seems closer to the solution we want eventually.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
257918615 https://github.com/pydata/xarray/issues/798#issuecomment-257918615 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NzkxODYxNQ== mrocklin 306380 2016-11-02T16:27:45Z 2016-11-02T16:27:45Z MEMBER

Custom serialization is in dask/distributed. This allows for us to build custom serialization solutions like the following for h5py.Dataset: https://github.com/dask/distributed/pull/620/files

Any concerns would be very welcome. Earlier is better.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
257292279 https://github.com/pydata/xarray/issues/798#issuecomment-257292279 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NzI5MjI3OQ== mrocklin 306380 2016-10-31T13:24:01Z 2016-10-31T14:49:31Z MEMBER

I may have a solution to this in https://github.com/dask/distributed/pull/606, which allows for custom serialization formats to be registered with dask.distributed. We would register serialize and deserialize functions for the various netCDF objects. Something like the following might work for h5py:

``` python def serialize_dataset(dset): header = {} frames = [dset.filename.encode(), dset.datapath.encode()] return header, frames

def deserialize_dataset(header, frames): filename, datapath = frames f = h5py.File(filename.decode()) dest = f[datapath.decode()] return dset

register_serialization(h5py.Dataset, serialize_dataset, deserialize_dataset) ```

We still have lingering open files but not too many per machine. They'll move around the network, but only as necessary.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
257063608 https://github.com/pydata/xarray/issues/798#issuecomment-257063608 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NzA2MzYwOA== shoyer 1217238 2016-10-29T01:45:09Z 2016-10-29T01:45:09Z MEMBER

Distributed Dask.array could possibly replace OpenDAP in some settings though

Yes, this sounds quite promising to me.

Using OpenDAP for communication is also possible, but if all we need to do is pass around serialized xarray.Dataset objects using pickle or even bytes from netCDF files seems more promising.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
257063168 https://github.com/pydata/xarray/issues/798#issuecomment-257063168 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NzA2MzE2OA== mrocklin 306380 2016-10-29T01:37:11Z 2016-10-29T01:37:11Z MEMBER

We could pull data from OpenDAP. Actually computing on those workers would probably be hard to integrate. Distributed Dask.array could possibly replace OpenDAP in some settings though, serving not only data, but also computation.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
257062571 https://github.com/pydata/xarray/issues/798#issuecomment-257062571 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NzA2MjU3MQ== delgadom 3698640 2016-10-29T01:26:47Z 2016-10-29T01:26:47Z CONTRIBUTOR

Could this extend the OpeNDAP interface? That would solve the metadata problem and would provide quick access to the distributed workers.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
256121613 https://github.com/pydata/xarray/issues/798#issuecomment-256121613 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NjEyMTYxMw== mrocklin 306380 2016-10-25T18:20:58Z 2016-10-25T18:20:58Z MEMBER

You wouldn't

On Tue, Oct 25, 2016 at 9:43 AM, Florian Rathgeber <notifications@github.com

wrote:

For the case where NetCDF / HDF5 files are only available on the distributed workers and not directly accessible from the client, how would you get the necessary metadata (coords, dims etc.) to construct the xarray.Dataset?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/798#issuecomment-256038226, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszIYbrYoqqJMwu5FFoxu5SWSJSTnoks5q3geGgaJpZM4H1p4q .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
256038226 https://github.com/pydata/xarray/issues/798#issuecomment-256038226 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NjAzODIyNg== kynan 346079 2016-10-25T13:43:32Z 2016-10-25T13:43:32Z NONE

For the case where NetCDF / HDF5 files are only available on the distributed workers and not directly accessible from the client, how would you get the necessary metadata (coords, dims etc.) to construct the xarray.Dataset?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255800363 https://github.com/pydata/xarray/issues/798#issuecomment-255800363 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTgwMDM2Mw== mrocklin 306380 2016-10-24T17:00:58Z 2016-10-24T17:00:58Z MEMBER

One alternative would be to define custom serialization for netCDF4.Dataset objects.

I've been toying with the idea of custom serialization for dask.distributed recently. This was originally intended to let Dask make some opinionated serialization choices for some common formats (usually so that we can serialize numpy arrays and pandas dataframes faster than their generic pickle implementations allow) but this might also be helpful here to allow us to serialize netCDF4.Dataset objects and friends.

We would define custom dumps and loads functions for netCDF4.Dataset objects that would presumably encode them as a filename and datapath. This would get around the open-many-files issue because the dataset would stay in the worker's .data dictionary while it was needed.

One concern is that there are reasons why netCDF4.Dataset objects are not serializable (see https://github.com/h5py/h5py/issues/531). I'm not sure if this would affect XArray workloads.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255797423 https://github.com/pydata/xarray/issues/798#issuecomment-255797423 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTc5NzQyMw== shoyer 1217238 2016-10-24T16:50:15Z 2016-10-24T16:50:15Z MEMBER

We could possibly make an object that was API compatible with the subset of netCDF4.Dataset that you needed, but opened and closed the file whenever it actually pulled data. We would keep an LRU cache of open files around for efficiency as discussed earlier. In this case we could possibly optionally swap out the current netCDF4.Dataset object with this thing without much refactoring?

Yes, this could work for a proof of concept.

In the long term, it would be good to integrate this into xarray so we can support alternative backends (e.g., h5netcdf, scipy, pynio, loaders for custom file formats like @rabernat and @pwolfram work with) in a fully consistent fashion without needing to make a separate wrapper for each.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255796531 https://github.com/pydata/xarray/issues/798#issuecomment-255796531 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTc5NjUzMQ== mrocklin 306380 2016-10-24T16:46:39Z 2016-10-24T16:46:39Z MEMBER

We seem to be making good progress here on the issue. I'm also happy to switch to real-time voice at any point today or tomorrow if people prefer.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255795874 https://github.com/pydata/xarray/issues/798#issuecomment-255795874 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTc5NTg3NA== mrocklin 306380 2016-10-24T16:44:10Z 2016-10-24T16:44:10Z MEMBER

We could possibly make an object that was API compatible with the subset of netCDF4.Dataset that you needed, but opened and closed the file whenever it actually pulled data. We would keep an LRU cache of open files around for efficiency as discussed earlier. In this case we could possibly optionally swap out the current netCDF4.Dataset object with this thing without much refactoring?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255794868 https://github.com/pydata/xarray/issues/798#issuecomment-255794868 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTc5NDg2OA== shoyer 1217238 2016-10-24T16:40:09Z 2016-10-24T16:40:09Z MEMBER

@mrocklin OK, that makes sense. In that case, we might indeed need to thread this through xarray's backends.

Currently, backends open a file (e.g., with netCDF4.Dataset) and create an OrderedDict of xarray.Variable objects with lazy arrays that load from the file on demand. To load this data with dask, pass these lazy arrays into dask.array.from_array.

This currently doesn't use dask.delayed for three reasons: 1. Historical: we wrote this system before dask existed. 2. Performance: our LazilyIndexedArray class is still more selective than dask.array for subsetting data from large chunks, which is essential for many interactive use cases. Despite getitem fusing, dask will sometimes load complete chunks. This is particularly true if we do some transformation of the array, of the sort that could be accomplished with dask's map_blocks. Using LazilyIndexedArray ensures that this only gets applied to loaded data. There are also performance benefits to keeping files open when possible (discussed above). 3. Dependencies: dask is still an optional dependency for xarray. I'd like to keep it that way, if possible.

It seems like a version of xarray's backends that doesn't always open files immediately would make it suitable for use in dask.distributed. So indeed, we'll need to do some serious refactoring.

One other thing that will need to be tackled eventually: xarray.merge and xarray.concat (used in open_mfdataset) still have some steps (checking for equality between arrays) that are applied sequentially. This is going to be a performance bottleneck when we start working with very large arrays. This really should be refactored such that dask can do these evaluations in a single step, rather than once per object. For now, this can be avoided in concat by using the data_vars/coords options.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255788269 https://github.com/pydata/xarray/issues/798#issuecomment-255788269 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTc4ODI2OQ== mrocklin 306380 2016-10-24T16:15:59Z 2016-10-24T16:15:59Z MEMBER

The futures_to_dask_arrays function has been deprecated at this point. The standard way to produce a distributed dask.array from custom functions is as follows: - Use dask.delayed to construct many lazy numpy arrays individually - Wrap each of these into a single-chunk dask.array using da.from_delayed(lazy_value, shape=..., dtype=...) - Use da.stack or da.concat to arrange these single-chunk dask.arrays into a larger dask.array.

The same approach could be used with XArray except that presumably we would need to do this for every relevant dataset within the NetCDF file.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255786548 https://github.com/pydata/xarray/issues/798#issuecomment-255786548 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTc4NjU0OA== shoyer 1217238 2016-10-24T16:10:14Z 2016-10-24T16:10:40Z MEMBER

I'm happy to help work out a plan here.

It seems like there are basically two steps we need to make this happen: 1. Write the equivalent of futures_to_dask_arrays for xarray.Dataset, i.e., futures_to_xarray_datasets_of_dask_arrays. 2. Integrate this into xarray's higher level utility functions like open_mfdataset. This should be pretty easy after we have futures_to_xarray_datasets_of_dask_arrays.

It's an open question to what extent this needs to interact with xarray's internal backends.DataStore API, which handles the details of decoding files on disk to xarray.Dataset objects. I'm hopeful the answer is "not very much". The DataStore API is a bit cumbersome and overly complex, and could use a refactoring.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255207705 https://github.com/pydata/xarray/issues/798#issuecomment-255207705 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTIwNzcwNQ== kynan 346079 2016-10-20T19:42:41Z 2016-10-20T19:42:41Z NONE

I'm probably not familiar enough with either the xarray or dask / distributed codebases to provide much input but would be happy to contribute if / where it makes sense. Would also be happy to be part of a some real-time discussion if feasible (based in the UK, so wouldn't be able to attend the workshop).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255201276 https://github.com/pydata/xarray/issues/798#issuecomment-255201276 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTIwMTI3Ng== mrocklin 306380 2016-10-20T19:16:59Z 2016-10-20T19:16:59Z MEMBER

I agree that we should discuss it at the workshop. I also think it's possible that this could be accomplished by the right person (or combination of people) in a few hours. If so I think that we should come with it in hand as a capability that exists rather than a capability that should exist.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255200677 https://github.com/pydata/xarray/issues/798#issuecomment-255200677 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTIwMDY3Nw== rabernat 1197350 2016-10-20T19:14:34Z 2016-10-20T19:14:34Z MEMBER

This is a really important idea that has the potential to accelerate xarray from "medium data" to "big data". It should be planned out thoughtfully.

My view is that we should implement a new DataStore class to handle distributed datasets. This could live in the xarray backend, or it could be a standalone package. Such a data store could be the foundation of a powerful platform for big-climate-data analysis. (Or maybe I am thinking too ambitiously.)

I think the upcoming aospy workshop will be an ideal opportunity to discuss this, since many of the people on this thread will be face-to-face.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255194191 https://github.com/pydata/xarray/issues/798#issuecomment-255194191 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTE5NDE5MQ== mrocklin 306380 2016-10-20T18:48:53Z 2016-10-20T18:48:53Z MEMBER

I agree that this conversation needs expertise from a core xarray developer. I suspect that this change is more likely to happen in xarray than in dask.array. Happy to continue the conversation wherever. I do have a slight preference to switch to real-time at some point though. I suspect that we can hash this out in a moderate number of minutes.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255192875 https://github.com/pydata/xarray/issues/798#issuecomment-255192875 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTE5Mjg3NQ== pwolfram 4295853 2016-10-20T18:44:03Z 2016-10-20T18:44:03Z CONTRIBUTOR

@mrocklin, I would be happy to chat because I am interested in seeing this happen (e.g., eventually contributing code). The question is whether we need additional expertise from @shoyer, @jhamman, @rabernat etc who likely have a greater in-depth understanding of xarray than me. Perhaps this warrants an email to the wider list?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255190556 https://github.com/pydata/xarray/issues/798#issuecomment-255190556 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTE5MDU1Ng== mrocklin 306380 2016-10-20T18:35:27Z 2016-10-20T18:35:27Z MEMBER

If XArray devs want to chat sometime I suspect we could hammer out a plan fairly quickly. My hope is that once a plan exists then a developer will arise to implement that plan. I'm free all of today and tomorrow.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255190289 https://github.com/pydata/xarray/issues/798#issuecomment-255190289 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTE5MDI4OQ== mrocklin 306380 2016-10-20T18:34:35Z 2016-10-20T18:34:35Z MEMBER

Definitely happy to support from the Dask side.

I think that the LRU method described above is feasible.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255188697 https://github.com/pydata/xarray/issues/798#issuecomment-255188697 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTE4ODY5Nw== pwolfram 4295853 2016-10-20T18:28:24Z 2016-10-20T18:28:24Z CONTRIBUTOR

@kynan, I'm still interested in this but have not had time to advance this further. Are you interested in contributing to this too?

I view this as a key component of future climate analysis workflows. This may also be something that is addressed at the upcoming hackathon at Columbia with @rabernat early next month.

Also, I suspect that both @mrocklin and @shoyer would be willing to continue to provide key advice because this appears to be aligned with their interests too (please correct me if I'm wrong in this assessment).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255187606 https://github.com/pydata/xarray/issues/798#issuecomment-255187606 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTE4NzYwNg== mrocklin 306380 2016-10-20T18:24:10Z 2016-10-20T18:24:10Z MEMBER

I haven't worked on this but agree that it is important.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255184991 https://github.com/pydata/xarray/issues/798#issuecomment-255184991 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTE4NDk5MQ== kynan 346079 2016-10-20T18:14:38Z 2016-10-20T18:14:38Z NONE

Has this issue progressed since?

Being able to distribute loading of files to a dask cluster and composing an xarray Dataset from data on remote workers would be a great feature.

Is @mrocklin's blog post from Feb 2016 still the reference for remote data loading on a cluster? Adapting it to loading xarray Datasets rather than plain arrays is not straightforward since there is no way to combine futures representing Datasets out of the box.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
209005106 https://github.com/pydata/xarray/issues/798#issuecomment-209005106 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwOTAwNTEwNg== mrocklin 306380 2016-04-12T16:55:02Z 2016-04-12T16:55:02Z MEMBER

It's probably best to avoid futures within xarray, so far they're only in the distributed memory scheduler. I think that ideally we create graphs that can be used robustly in either. I think that the memoized netCDF4_Dataset approach can probably do this just fine. Is there anything that is needed from me to help push this forward?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
205492861 https://github.com/pydata/xarray/issues/798#issuecomment-205492861 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNTQ5Mjg2MQ== shoyer 1217238 2016-04-04T20:54:42Z 2016-04-04T20:54:42Z MEMBER

@shoyer, if if we are happy to open all netCDF files and read out the metadata from a master process that would imply that we would open a file, read the metadata, and then close it, correct?

Array access should then follow something like the @mrocklin's netcdf_Dataset approach, right?

Yes, this is correct.

In principle, if we have a very large number of files containing many variables each, we might want to do the read in parallel using futures, and then use something like futures_to_dask_arrays to bring them together. That seems much trickier to integrate into our current backend approach.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
205484614 https://github.com/pydata/xarray/issues/798#issuecomment-205484614 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNTQ4NDYxNA== shoyer 1217238 2016-04-04T20:40:58Z 2016-04-04T20:40:58Z MEMBER

@pwolfram I was referring to this comment for @mrocklin's netCDF4_Dataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
205481557 https://github.com/pydata/xarray/issues/798#issuecomment-205481557 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNTQ4MTU1Nw== pwolfram 4295853 2016-04-04T20:32:23Z 2016-04-04T20:32:23Z CONTRIBUTOR

@shoyer, if if we are happy to open all netCDF files and read out the metadata from a master process that would imply that we would open a file, read the metadata, and then close it, correct?

Array access should then follow something like the @mrocklin's netcdf_Dataset approach, right?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
205481269 https://github.com/pydata/xarray/issues/798#issuecomment-205481269 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNTQ4MTI2OQ== pwolfram 4295853 2016-04-04T20:31:24Z 2016-04-04T20:31:24Z CONTRIBUTOR

@fmaussion, for 1. The LRU cache should be used serially for the read initially, but something more like @mrocklin's netcdf_Dataset appears to be needed as @shoyer points out. I need to think about this more. 2. I was thinking we would keep track of the file name outside the LRU and only use the filename to open up datasets inside the LRU if they aren't already open. Agreed that if file in LRU should designate whether the file is open.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
205478991 https://github.com/pydata/xarray/issues/798#issuecomment-205478991 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNTQ3ODk5MQ== pwolfram 4295853 2016-04-04T20:24:41Z 2016-04-04T20:24:41Z CONTRIBUTOR

Just to be clear, we are talking about this https://github.com/mrocklin/hdf5lazy/blob/master/hdf5lazy/core.py#L83 for @mrocklin's netcdf_Dataset, right?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
205375803 https://github.com/pydata/xarray/issues/798#issuecomment-205375803 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNTM3NTgwMw== shoyer 1217238 2016-04-04T16:25:03Z 2016-04-04T16:25:03Z MEMBER

I think the LRU dict has to be a global because because the file restriction is an attribute of the system, correct?

Correct, the LRU dict should be global. I believe the file restriction is generally per-process, and creating a global dict should assure that works properly.

For each read from a file, ensure it hasn't been closed via a @ds.getter property method. If so, reopen it via the LRU cache. This is ok because for a read the file is essentially read-only. The LRU closes out stale entries to prevent the too many open file errors. Checking this should be fast.

The challenge is that we only call the .get_variables() method (and hence self.ds) once on a DataStore when a Dataset is opened from disk. I think we need to refactor NetCDF4ArrayWrapper to take a filename instead, and use something like @mrocklin's netcdf_Dataset.

My bigger concern was how to make use of a method like futures_to_dask_arrays. But it looks like that may actually not be necessary, at least if we are happy to open all netCDF files (and read out the metadata) from a master process.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
205370162 https://github.com/pydata/xarray/issues/798#issuecomment-205370162 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNTM3MDE2Mg== fmaussion 10050469 2016-04-04T16:08:57Z 2016-04-04T16:08:57Z MEMBER

Sorry if I am just producing noise here (I am not a specialist), but I have two naive questions:

To 1. how will you handle concurrent access to the LRU cache if it's a global variable?

To 2. Once the file has been closed by the LRU, won't it also be erased from it? So that a simple if file in LRU: could suffice as a test if the file has been closed or not?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
205133433 https://github.com/pydata/xarray/issues/798#issuecomment-205133433 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNTEzMzQzMw== pwolfram 4295853 2016-04-04T04:35:09Z 2016-04-04T04:35:09Z CONTRIBUTOR

Thanks @mrocklin! This has been really helpful and was what I needed to get going.

A prelim design I'm seeing is to modify the NetCDF4DataStore class https://github.com/pydata/xarray/blob/master/xarray/backends/netCDF4_.py#L170 to meet these requirements: 1. At __init__, try to open file via the LRU cache. I think the LRU dict has to be a global because because the file restriction is an attribute of the system, correct? 2. For each read from a file, ensure it hasn't been closed via a @ds.getter property method. If so, reopen it via the LRU cache. This is ok because for a read the file is essentially read-only. The LRU closes out stale entries to prevent the too many open file errors. Checking this should be fast. 3. sync is only for a write but seems like it should following the above approach.

A clean way to do this is just to make sure that each time self.ds is called, it is re-validated via the LRU cache. This should be able to be implemented via property getter methods https://docs.python.org/2/library/functions.html#property.

Unless I'm missing something big, I don't think this change will require at large refactor but it is quite possible I overlooked something important. @shoyer and @mrocklin, do you see any obvious pitfalls in this scope? If not, it shouldn't be too hard to implement.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
204813232 https://github.com/pydata/xarray/issues/798#issuecomment-204813232 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNDgxMzIzMg== mrocklin 306380 2016-04-02T22:29:04Z 2016-04-02T22:29:04Z MEMBER

FWIW I've uploaded a tiny LRU dict implementation to a new zict project (which also has some other stuff):

http://zict.readthedocs.org/en/latest/

pip install zict

python from zict import LRU d = LRU(100, dict())

There are a number of good alternatives out there though for LRU dictionaries.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
204770198 https://github.com/pydata/xarray/issues/798#issuecomment-204770198 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNDc3MDE5OA== pwolfram 4295853 2016-04-02T18:25:18Z 2016-04-02T18:25:18Z CONTRIBUTOR

Another note in support of this PR, especially "robustly support HDF/NetCDF reads": I am having problems with NetCDF: HDF error as previously reported by @rabernat in https://github.com/pydata/xarray/issues/463. Thus, a solution here will save time and may arguably be on the critical path of some workflows because fewer jobs will fail and require baby-sitting/restarts, especially when dealing with running multiple jobs.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
202696169 https://github.com/pydata/xarray/issues/798#issuecomment-202696169 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwMjY5NjE2OQ== pwolfram 4295853 2016-03-29T03:49:11Z 2016-03-29T14:24:02Z CONTRIBUTOR

Thanks @shoyer. If you can provide some guidance on bounds for the reorganization that would be really great. I want your and @jhamman's feedback on this before I try a solution. The trick is just to make the time, as always, and I may have some time this coming weekend.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
201134785 https://github.com/pydata/xarray/issues/798#issuecomment-201134785 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwMTEzNDc4NQ== shoyer 1217238 2016-03-25T04:54:09Z 2016-03-25T04:54:09Z MEMBER

I agree with @mrocklin that the LRUCache for file-like objects should take care of things from the dask.array perspective. It should also solve https://github.com/pydata/xarray/issues/463 in a very clean way. We'll just need to reorganize things a bit to make use of it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
200901275 https://github.com/pydata/xarray/issues/798#issuecomment-200901275 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwMDkwMTI3NQ== mrocklin 306380 2016-03-24T16:00:52Z 2016-03-24T16:00:52Z MEMBER

I believe that robustly supporting HDF/NetCDF reads with the mechanism mentioned above will resolve most problems from a dask.array perspective. I have no doubt that other things will arise though. Switching from shared to distributed memory always come with (surmountable) obstacles

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
200878845 https://github.com/pydata/xarray/issues/798#issuecomment-200878845 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwMDg3ODg0NQ== pwolfram 4295853 2016-03-24T15:09:42Z 2016-03-24T15:13:18Z CONTRIBUTOR

This issue of connecting to dask/distributed may also be connected with https://github.com/pydata/xarray/issues/463, https://github.com/pydata/xarray/issues/591, and https://github.com/pydata/xarray/pull/524.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
200633312 https://github.com/pydata/xarray/issues/798#issuecomment-200633312 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwMDYzMzMxMg== pwolfram 4295853 2016-03-24T03:04:25Z 2016-03-24T03:04:25Z CONTRIBUTOR

Repeating @mrocklin:

Dask.array writes data to any object that supports numpy style setitem syntax like the following:

dataset[my_slice] = my_numpy_array

Objects like h5py.Dataset and netcdf objects support this syntax.

So dask.array would work today without modification if we had such an object that represented many netcdf files at once and supported numpy-style setitem syntax, placing the numpy array properly across the right files. This work could happen easily without deep knowledge of either project.

Alternatively, we could make the dask.array.store function optionally lazy so that users (or xarray) could call store many times before triggering execution.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
200632521 https://github.com/pydata/xarray/issues/798#issuecomment-200632521 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwMDYzMjUyMQ== pwolfram 4295853 2016-03-24T03:02:19Z 2016-03-24T03:02:19Z CONTRIBUTOR

@shoyer and @mrocklin, I've updated the summary above in the PR description with a to do list. Do either of you see any obvious tasks I missed on the list in the PR description? If so, can you please update the to do list so that I can see what needs done to modify the backend for the dask/distributed integration?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
199547374 https://github.com/pydata/xarray/issues/798#issuecomment-199547374 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDE5OTU0NzM3NA== pwolfram 4295853 2016-03-22T00:01:55Z 2016-03-22T00:02:11Z CONTRIBUTOR

Here is an example of a use case for a nanmean over ensembles in collaboration with @mrocklin and following http://matthewrocklin.com/blog/work/2016/02/26/dask-distributed-part-3: https://gist.github.com/mrocklin/566a8d5c3f6721abf36f

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
199545836 https://github.com/pydata/xarray/issues/798#issuecomment-199545836 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDE5OTU0NTgzNg== mrocklin 306380 2016-03-21T23:59:18Z 2016-03-21T23:59:18Z MEMBER

Copying over a comment from that issue:

Yes, so the problem as I see it is that, for serialization and open-file reasons we want to use a function like the following:

python def get_chunk_of_array(filename, datapath, slice): with netCDF4.Dataset(filename) as f: return f.variables[datapath][slice]

However, this opens and closes many files, which while robust, is slow. We can alleviate this by maintaining an LRU cache in a global variable so that it is created separately per process.

``` python from toolz import memoize

cache = LRUDict(size=100, on_eviction=lambda file: file.close())

netCDF4_Dataset = memoize(netCDF4.Dataset, cache=cache)

def def get_chunk_of_array(filename, datapath, slice): f = netCDF4_Dataset(filename) return f.variables[datapath][slice] ```

I'm happy to supply the memoize function with toolz and an appropriate LRUDict object with other microprojects that I can publish if necessary.

We would then need to use such a function within the dask.array and xarary codebases.

Anyway, that's one approach. Thoughts welcome.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
199544731 https://github.com/pydata/xarray/issues/798#issuecomment-199544731 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDE5OTU0NDczMQ== pwolfram 4295853 2016-03-21T23:57:27Z 2016-03-21T23:57:27Z CONTRIBUTOR

See also https://github.com/dask/dask/issues/922

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
199532452 https://github.com/pydata/xarray/issues/798#issuecomment-199532452 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDE5OTUzMjQ1Mg== pwolfram 4295853 2016-03-21T23:21:07Z 2016-03-21T23:21:07Z CONTRIBUTOR

The full mailing list discussion is at https://groups.google.com/d/msgid/xarray/CAJ8oX-E7Xx6NT4F6J8B4__Q-kBazoob9_qe_oFLi5hany9-%3DKQ%40mail.gmail.com?utm_medium=email&utm_source=footer

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 2480.682ms · About: xarray-datasette