home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

12 rows where author_association = "MEMBER", issue = 142498006 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • shoyer · 12 ✖

issue 1

  • Integration with dask/distributed (xarray backend design) · 12 ✖

author_association 1

  • MEMBER · 12 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
263431065 https://github.com/pydata/xarray/issues/798#issuecomment-263431065 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI2MzQzMTA2NQ== shoyer 1217238 2016-11-28T23:42:54Z 2016-11-28T23:42:54Z MEMBER

@mrocklin Any thoughts on my thread safety concerns (https://github.com/pydata/xarray/issues/798#issuecomment-259202265) for the LRU cache? I suppose the simplest thing to do is to simply refuse to evict a file until the per-file lock is released, but I can see that strategy failing pretty badly in edge cases.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
259202265 https://github.com/pydata/xarray/issues/798#issuecomment-259202265 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1OTIwMjI2NQ== shoyer 1217238 2016-11-08T17:27:55Z 2016-11-08T22:19:11Z MEMBER

A few other thoughts on thread safety with the LRU approach: 1. We need to a global lock ensure internal consistency of the LRU cache, and so that we don't overwrite files without closing them. It probably makes sense to put this in memoize function. 2. We need separate, per file locks, to ensure that we don't evict files in the process of reading or writing data from them (which would cause segfaults). As a stop-gap measure, we could simply refuse to evict files until we can acquire a lock, but more broadly this suggests that strict LRU is not quite right. Instead, we want to evict the least-recently-used unlocked item.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
259185165 https://github.com/pydata/xarray/issues/798#issuecomment-259185165 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1OTE4NTE2NQ== shoyer 1217238 2016-11-08T16:28:13Z 2016-11-08T16:28:13Z MEMBER

One slight subtlety is writes. We'll need to switch from 'w' to 'a' mode the second time we open a file. On Tue, Nov 8, 2016 at 8:17 AM Matthew Rocklin notifications@github.com wrote:

FYI Dask is committed to maintaining this: https://github.com/dask/zict/blob/master/zict/lru.py

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/798#issuecomment-259181856, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1rz8sYoBXjMbJvQqrP3XHZx3_fJhks5q8KCRgaJpZM4H1p4q .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
259181526 https://github.com/pydata/xarray/issues/798#issuecomment-259181526 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1OTE4MTUyNg== shoyer 1217238 2016-11-08T16:16:15Z 2016-11-08T16:16:15Z MEMBER

We have something very hacky working with https://github.com/pydata/xarray/pull/1095

I'm also going to see if I can get something working with the LRU cache, since that seems closer to the solution we want eventually.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
257063608 https://github.com/pydata/xarray/issues/798#issuecomment-257063608 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NzA2MzYwOA== shoyer 1217238 2016-10-29T01:45:09Z 2016-10-29T01:45:09Z MEMBER

Distributed Dask.array could possibly replace OpenDAP in some settings though

Yes, this sounds quite promising to me.

Using OpenDAP for communication is also possible, but if all we need to do is pass around serialized xarray.Dataset objects using pickle or even bytes from netCDF files seems more promising.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255797423 https://github.com/pydata/xarray/issues/798#issuecomment-255797423 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTc5NzQyMw== shoyer 1217238 2016-10-24T16:50:15Z 2016-10-24T16:50:15Z MEMBER

We could possibly make an object that was API compatible with the subset of netCDF4.Dataset that you needed, but opened and closed the file whenever it actually pulled data. We would keep an LRU cache of open files around for efficiency as discussed earlier. In this case we could possibly optionally swap out the current netCDF4.Dataset object with this thing without much refactoring?

Yes, this could work for a proof of concept.

In the long term, it would be good to integrate this into xarray so we can support alternative backends (e.g., h5netcdf, scipy, pynio, loaders for custom file formats like @rabernat and @pwolfram work with) in a fully consistent fashion without needing to make a separate wrapper for each.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255794868 https://github.com/pydata/xarray/issues/798#issuecomment-255794868 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTc5NDg2OA== shoyer 1217238 2016-10-24T16:40:09Z 2016-10-24T16:40:09Z MEMBER

@mrocklin OK, that makes sense. In that case, we might indeed need to thread this through xarray's backends.

Currently, backends open a file (e.g., with netCDF4.Dataset) and create an OrderedDict of xarray.Variable objects with lazy arrays that load from the file on demand. To load this data with dask, pass these lazy arrays into dask.array.from_array.

This currently doesn't use dask.delayed for three reasons: 1. Historical: we wrote this system before dask existed. 2. Performance: our LazilyIndexedArray class is still more selective than dask.array for subsetting data from large chunks, which is essential for many interactive use cases. Despite getitem fusing, dask will sometimes load complete chunks. This is particularly true if we do some transformation of the array, of the sort that could be accomplished with dask's map_blocks. Using LazilyIndexedArray ensures that this only gets applied to loaded data. There are also performance benefits to keeping files open when possible (discussed above). 3. Dependencies: dask is still an optional dependency for xarray. I'd like to keep it that way, if possible.

It seems like a version of xarray's backends that doesn't always open files immediately would make it suitable for use in dask.distributed. So indeed, we'll need to do some serious refactoring.

One other thing that will need to be tackled eventually: xarray.merge and xarray.concat (used in open_mfdataset) still have some steps (checking for equality between arrays) that are applied sequentially. This is going to be a performance bottleneck when we start working with very large arrays. This really should be refactored such that dask can do these evaluations in a single step, rather than once per object. For now, this can be avoided in concat by using the data_vars/coords options.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
255786548 https://github.com/pydata/xarray/issues/798#issuecomment-255786548 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDI1NTc4NjU0OA== shoyer 1217238 2016-10-24T16:10:14Z 2016-10-24T16:10:40Z MEMBER

I'm happy to help work out a plan here.

It seems like there are basically two steps we need to make this happen: 1. Write the equivalent of futures_to_dask_arrays for xarray.Dataset, i.e., futures_to_xarray_datasets_of_dask_arrays. 2. Integrate this into xarray's higher level utility functions like open_mfdataset. This should be pretty easy after we have futures_to_xarray_datasets_of_dask_arrays.

It's an open question to what extent this needs to interact with xarray's internal backends.DataStore API, which handles the details of decoding files on disk to xarray.Dataset objects. I'm hopeful the answer is "not very much". The DataStore API is a bit cumbersome and overly complex, and could use a refactoring.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
205492861 https://github.com/pydata/xarray/issues/798#issuecomment-205492861 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNTQ5Mjg2MQ== shoyer 1217238 2016-04-04T20:54:42Z 2016-04-04T20:54:42Z MEMBER

@shoyer, if if we are happy to open all netCDF files and read out the metadata from a master process that would imply that we would open a file, read the metadata, and then close it, correct?

Array access should then follow something like the @mrocklin's netcdf_Dataset approach, right?

Yes, this is correct.

In principle, if we have a very large number of files containing many variables each, we might want to do the read in parallel using futures, and then use something like futures_to_dask_arrays to bring them together. That seems much trickier to integrate into our current backend approach.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
205484614 https://github.com/pydata/xarray/issues/798#issuecomment-205484614 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNTQ4NDYxNA== shoyer 1217238 2016-04-04T20:40:58Z 2016-04-04T20:40:58Z MEMBER

@pwolfram I was referring to this comment for @mrocklin's netCDF4_Dataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
205375803 https://github.com/pydata/xarray/issues/798#issuecomment-205375803 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwNTM3NTgwMw== shoyer 1217238 2016-04-04T16:25:03Z 2016-04-04T16:25:03Z MEMBER

I think the LRU dict has to be a global because because the file restriction is an attribute of the system, correct?

Correct, the LRU dict should be global. I believe the file restriction is generally per-process, and creating a global dict should assure that works properly.

For each read from a file, ensure it hasn't been closed via a @ds.getter property method. If so, reopen it via the LRU cache. This is ok because for a read the file is essentially read-only. The LRU closes out stale entries to prevent the too many open file errors. Checking this should be fast.

The challenge is that we only call the .get_variables() method (and hence self.ds) once on a DataStore when a Dataset is opened from disk. I think we need to refactor NetCDF4ArrayWrapper to take a filename instead, and use something like @mrocklin's netcdf_Dataset.

My bigger concern was how to make use of a method like futures_to_dask_arrays. But it looks like that may actually not be necessary, at least if we are happy to open all netCDF files (and read out the metadata) from a master process.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006
201134785 https://github.com/pydata/xarray/issues/798#issuecomment-201134785 https://api.github.com/repos/pydata/xarray/issues/798 MDEyOklzc3VlQ29tbWVudDIwMTEzNDc4NQ== shoyer 1217238 2016-03-25T04:54:09Z 2016-03-25T04:54:09Z MEMBER

I agree with @mrocklin that the LRUCache for file-like objects should take care of things from the dask.array perspective. It should also solve https://github.com/pydata/xarray/issues/463 in a very clean way. We'll just need to reorganize things a bit to make use of it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Integration with dask/distributed (xarray backend design) 142498006

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 216.613ms · About: xarray-datasette