home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

4 rows where issue = 221387277 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • shoyer · 4 ✖

issue 1

  • decode_cf() loads chunked arrays · 4 ✖

author_association 1

  • MEMBER 4
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
293751509 https://github.com/pydata/xarray/issues/1372#issuecomment-293751509 https://api.github.com/repos/pydata/xarray/issues/1372 MDEyOklzc3VlQ29tbWVudDI5Mzc1MTUwOQ== shoyer 1217238 2017-04-13T01:27:00Z 2017-04-13T01:27:00Z MEMBER

But for this application in particular presumably all of these already tasks fuse together into a single composite task, yes? Then we would need to commute certain operations in some order (preferring to push down slicing operations). Then we would need to merge all of the slices, yes?

Yes, that sounds right to me.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  decode_cf() loads chunked arrays 221387277
293748654 https://github.com/pydata/xarray/issues/1372#issuecomment-293748654 https://api.github.com/repos/pydata/xarray/issues/1372 MDEyOklzc3VlQ29tbWVudDI5Mzc0ODY1NA== shoyer 1217238 2017-04-13T01:05:42Z 2017-04-13T01:06:17Z MEMBER

Ah, so here's the thing: decode_cf is actually lazy, but it uses some old xarray machinery for lazily decoding conventions instead of dask: In [21]: xr.decode_cf(chunked).foo.variable._data Out[21]: LazilyIndexedArray(array=dask.array<xarray-foo, shape=(100,), dtype=int64, chunksize=(100,)>, key=(slice(None, None, None),)) You could call .chunk() on it again to load it into dask, but even though the whole thing is lazy, the new dask arrays don't know they are holding dask arrays, which makes life difficult for dask.

You might ask why this separate lazy compute machinery exists. The answer is that dask fails to optimize element-wise operations like (scale * array)[subset] -> scale * array[subset], which is a critical optimization for lazy decoding of large datasets.

See https://github.com/dask/dask/issues/746 for discussion and links to PRs about this. @jcrist had a solution that worked, but it slowed down every dask array operations by 20%, which wasn't a great win.

I wonder if this is worth revisiting with a simpler, less general optimization pass that doesn't bother with broadcasting. See the subclasses of NDArrayMixin in xarray/conventions.py for examples of the sorts of functionality we need: - Casting (e.g., array.astype(bool)). - Chained arithmetic with scalars (e.g., 0.5 + 0.5 * array). - Custom element-wise operations (e.g., map_blocks(convert_to_datetime64, array, dtype=np.datetime64)) - Custom aggregations that drop a dimension (e.g., map_blocks(characters_to_string, array, drop_axis=-1))

If we could optimize all these operations (and ideally chain them), then we could drop all the lazy loading stuff from xarray in favor of dask, which would be a real win. @mrocklin any thoughts?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  decode_cf() loads chunked arrays 221387277
293743835 https://github.com/pydata/xarray/issues/1372#issuecomment-293743835 https://api.github.com/repos/pydata/xarray/issues/1372 MDEyOklzc3VlQ29tbWVudDI5Mzc0MzgzNQ== shoyer 1217238 2017-04-13T00:27:32Z 2017-04-13T00:27:32Z MEMBER

A simple test case: ``` In [16]: ds = xr.Dataset({'foo': ('x', np.arange(100))})

In [17]: chunked = ds.chunk()

In [18]: chunked.foo Out[18]: <xarray.DataArray 'foo' (x: 100)> dask.array<xarray-foo, shape=(100,), dtype=int64, chunksize=(100,)> Dimensions without coordinates: x

In [19]: xr.decode_cf(chunked).foo Out[19]: <xarray.DataArray 'foo' (x: 100)> array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]) Dimensions without coordinates: x ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  decode_cf() loads chunked arrays 221387277
293725434 https://github.com/pydata/xarray/issues/1372#issuecomment-293725434 https://api.github.com/repos/pydata/xarray/issues/1372 MDEyOklzc3VlQ29tbWVudDI5MzcyNTQzNA== shoyer 1217238 2017-04-12T22:28:17Z 2017-04-12T22:28:17Z MEMBER

decode_cf() should definitely work lazily on dask array. If not, I would consider that a bug.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  decode_cf() loads chunked arrays 221387277

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 506.261ms · About: xarray-datasette