home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

8 rows where issue = 274797981 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • shoyer 4
  • rabernat 2
  • jhamman 1
  • benbovy 1

issue 1

  • Switch our lazy array classes to use Dask instead? · 8 ✖

author_association 1

  • MEMBER 8
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
406105313 https://github.com/pydata/xarray/issues/1725#issuecomment-406105313 https://api.github.com/repos/pydata/xarray/issues/1725 MDEyOklzc3VlQ29tbWVudDQwNjEwNTMxMw== shoyer 1217238 2018-07-18T23:28:28Z 2018-07-18T23:28:28Z MEMBER

On a somewhat related note, I am now proposing extending xarray's "lazy array" functionality to include limited support for arithmetic, without necessarily using dask: https://github.com/pydata/xarray/issues/2298

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Switch our lazy array classes to use Dask instead? 274797981
345463291 https://github.com/pydata/xarray/issues/1725#issuecomment-345463291 https://api.github.com/repos/pydata/xarray/issues/1725 MDEyOklzc3VlQ29tbWVudDM0NTQ2MzI5MQ== shoyer 1217238 2017-11-18T19:00:59Z 2017-11-18T19:00:59Z MEMBER

@rabernat actually in #1532 we switched to not displaying a preview of any lazily loaded data on disk -- even if it isn't loaded with Dask. (I was not sure about this change, but I was alone in my reservations.)

I do agree that our lazy arrays serve a useful purpose currently. I would only consider removing them if we can improve Dask so it works just as well for this use case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Switch our lazy array classes to use Dask instead? 274797981
345434065 https://github.com/pydata/xarray/issues/1725#issuecomment-345434065 https://api.github.com/repos/pydata/xarray/issues/1725 MDEyOklzc3VlQ29tbWVudDM0NTQzNDA2NQ== rabernat 1197350 2017-11-18T10:47:48Z 2017-11-18T10:47:48Z MEMBER

Since #1532, the repr for dask-backed variables does not show any values (to avoid triggering computations). But numpy-backed lazily-masked-and-scaled data is treated differently: it is shown.

This highlights an important difference between how LazilyIndexedArray and dask array work: with dask, either you compute a whole chunk or you compute nothing. With LazilyIndexedArray, you can slice the array however you want and only apply mask_and_scale to the specific items you have selected. This small difference has big performance implications, especially for the "medium sized" datasets @benbovy refers to. If we changed so that decode_cf used dask, you would have to compute the whole chunk in order to see the repr.

So on second thought, maybe the system we have now is better than using dask for "everything lazy."

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Switch our lazy array classes to use Dask instead? 274797981
345394914 https://github.com/pydata/xarray/issues/1725#issuecomment-345394914 https://api.github.com/repos/pydata/xarray/issues/1725 MDEyOklzc3VlQ29tbWVudDM0NTM5NDkxNA== benbovy 4160723 2017-11-17T23:39:33Z 2017-11-17T23:39:33Z MEMBER

I'm rather a numpy-xarray user than a dask-xarray user (since most often my data fits in memory), but I wouldn't mind at all having to install dask as a requirement!

Potentially chunks=False in open_dataset could indicate that you're OK loading everything into memory with NumPy. We would then have to choose between making the default use Dask or NumPy.

Maybe like other users who are used to lazy loading, I'm a bit more concerned by this. I find it so handy to be able to load a medium-sized file instantly, quickly inspect its content, and then work with only a small subset of the variables / data, all of this without worrying about chunks.

Assuming that numpy-loading is the default, new xarray users coming from netcdf4-python and who don't know much about dask might find xarray a very inefficient tool when trying to first import a medium-sized netcdf file.

If choosing chunks=False as default, I can also imagine often forgetting to set it to True when loading a file that is a bit too big to load in memory... That would be annoying.

By saying "making the default use Dask", do you mean that data from a file will be "loaded" as dask arrays by default? If this is the case, new xarray users which are probably not familiar with dask (at least less likely than they are familiar with numpy) will have to learn 1-2 concepts from dask before using xarray. This might not be a big deal, though.

In summary, I'm also really not opposed to use dask to replace all the current lazy-loading machinery, but ideally it should be as transparent as possible with respect to the current "user experience".

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Switch our lazy array classes to use Dask instead? 274797981
345374271 https://github.com/pydata/xarray/issues/1725#issuecomment-345374271 https://api.github.com/repos/pydata/xarray/issues/1725 MDEyOklzc3VlQ29tbWVudDM0NTM3NDI3MQ== rabernat 1197350 2017-11-17T21:46:14Z 2017-11-17T21:46:14Z MEMBER

I just had to confront and understand how lazy CF decoding worked in order to move forward with #1528. In my initial implementation, I applied chunking to variables directly in ZarrStore. However, I learned that decode_cf_variable did not preserve variable chunks. So I switched the chunking to after the call to decode_cf.

My impression after this exercise is that having two different definitions of "lazy" within xarray leads to developer confusion! So I favor putting dask more central in xarray's data model.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Switch our lazy array classes to use Dask instead? 274797981
345345971 https://github.com/pydata/xarray/issues/1725#issuecomment-345345971 https://api.github.com/repos/pydata/xarray/issues/1725 MDEyOklzc3VlQ29tbWVudDM0NTM0NTk3MQ== shoyer 1217238 2017-11-17T19:39:35Z 2017-11-17T19:39:35Z MEMBER

Yeah, we could solve this by making dask a requirement only if you want load netCDF files and/or load netCDF files lazily.

Potentially chunks=False in open_dataset could indicate that you're OK loading everything into memory with NumPy. We would then have to choose between making the default use Dask or NumPy.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Switch our lazy array classes to use Dask instead? 274797981
345333385 https://github.com/pydata/xarray/issues/1725#issuecomment-345333385 https://api.github.com/repos/pydata/xarray/issues/1725 MDEyOklzc3VlQ29tbWVudDM0NTMzMzM4NQ== jhamman 2443309 2017-11-17T18:58:08Z 2017-11-17T18:58:08Z MEMBER

The downside of this switch is that lazy loading of data from disk would now require dask, which would be at least slightly annoying to some users. But it's probably worth the tradeoff from a maintainability perspective, and also to fix issues like #1372.

I'm really not opposed to this. I also think it would be a good/reasonable thing to do before 1.0. There may be some pure-numpy Xarray users out there but I suspect they could 1) handle having dask as a dependency and 2) wouldn't be all that affected by the change since the in-memory workflow isn't particularly dependent on lazy evaluation.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Switch our lazy array classes to use Dask instead? 274797981
345300165 https://github.com/pydata/xarray/issues/1725#issuecomment-345300165 https://api.github.com/repos/pydata/xarray/issues/1725 MDEyOklzc3VlQ29tbWVudDM0NTMwMDE2NQ== shoyer 1217238 2017-11-17T16:55:38Z 2017-11-17T16:55:38Z MEMBER

This comment has the full context: https://github.com/pydata/xarray/issues/1372#issuecomment-293748654. To repeat myself:


You might ask why this separate lazy compute machinery exists. The answer is that dask fails to optimize element-wise operations like (scale * array)[subset] -> scale * array[subset], which is a critical optimization for lazy decoding of large datasets.

See https://github.com/dask/dask/issues/746 for discussion and links to PRs about this. jcrist had a solution that worked, but it slowed down every dask array operations by 20%, which wasn't a great win.

I wonder if this is worth revisiting with a simpler, less general optimization pass that doesn't bother with broadcasting. See the subclasses of NDArrayMixin in xarray/conventions.py for examples of the sorts of functionality we need: - Casting (e.g., array.astype(bool)). - Chained arithmetic with scalars (e.g., 0.5 + 0.5 * array). - Custom element-wise operations (e.g., map_blocks(convert_to_datetime64, array, dtype=np.datetime64)) - Custom aggregations that drop a dimension (e.g., map_blocks(characters_to_string, array, drop_axis=-1))

If we could optimize all these operations (and ideally chain them), then we could drop all the lazy loading stuff from xarray in favor of dask, which would be a real win.


The downside of this switch is that lazy loading of data from disk would now require dask, which would be at least slightly annoying to some users. But it's probably worth the tradeoff from a maintainability perspective, and also to fix issues like #1372.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Switch our lazy array classes to use Dask instead? 274797981

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 12.376ms · About: xarray-datasette