github: issue_comments: 7 rows where issue = 221387277 sorted by updated

7 rows where issue = 221387277 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
338019155	https://github.com/pydata/xarray/issues/1372#issuecomment-338019155	https://api.github.com/repos/pydata/xarray/issues/1372	MDEyOklzc3VlQ29tbWVudDMzODAxOTE1NQ==	rabernat 1197350	2017-10-19T19:56:12Z	2017-10-19T20:06:53Z	MEMBER	I just hit this issue. I tried to reproduce it with a synthetic dataset, as in @shoyer's example, but I couldn't. I can only reproduce it with data loaded from netcdf4 via open_mfdataset. I downloaded one year of air-temperature data from NARR: ftp://ftp.cdc.noaa.gov/Datasets/NARR/Dailies/pressure/ I load it this way (preprocessing is necessary to resolve conflict between `missing_value` and `_FillValue`): `python def preprocess_narr(ds): del ds.air.attrs['_FillValue'] ds = ds.set_coords(['lon', 'lat', 'Lambert_Conformal', 'time_bnds']) return ds ds = xr.open_mfdataset('air..nc', preprocess=preprocess_narr, decode_cf=False) print(ds) print(ds.chunks)` <xarray.Dataset> Dimensions: (level: 29, nbnds: 2, time: 365, x: 349, y: 277) Coordinates: level (level) float32 1000.0 975.0 950.0 925.0 900.0 875.0 ... lat (y, x) float32 1.0 1.10431 1.20829 1.31196 1.4153 ... lon (y, x) float32 -145.5 -145.315 -145.13 -144.943 ... * y (y) float32 0.0 32463.0 64926.0 97389.0 129852.0 ... * x (x) float32 0.0 32463.0 64926.0 97389.0 129852.0 ... Lambert_Conformal int32 -2147483647 * time (time) float64 1.666e+06 1.666e+06 1.666e+06 ... time_bnds (time, nbnds) float64 1.666e+06 1.666e+06 1.666e+06 ... Dimensions without coordinates: nbnds Data variables: air (time, level, y, x) float32 297.475 297.463 297.453 ... Frozen(SortedKeysDict({'y': (277,), 'x': (349,), 'time': (31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31), 'level': (29,), 'nbnds': (2,)})) If I try to decode cf, it returns instantly. `python ds_decode = xr.decode_cf(ds) print(ds_decode)` <xarray.Dataset> Dimensions: (level: 29, nbnds: 2, time: 365, x: 349, y: 277) Coordinates: * level (level) float32 1000.0 975.0 950.0 925.0 900.0 875.0 ... lat (y, x) float32 1.0 1.10431 1.20829 1.31196 1.4153 ... lon (y, x) float32 -145.5 -145.315 -145.13 -144.943 ... * y (y) float32 0.0 32463.0 64926.0 97389.0 129852.0 ... * x (x) float32 0.0 32463.0 64926.0 97389.0 129852.0 ... Lambert_Conformal int32 -2147483647 * time (time) datetime64[ns] 1990-01-01 1990-01-02 ... time_bnds (time, nbnds) float64 1.666e+06 1.666e+06 1.666e+06 ... Dimensions without coordinates: nbnds Data variables: air (time, level, y, x) float64 297.5 297.5 297.5 297.4 ... There are no more chunks: `ds_decode.air.chunks is None` but `ds_decode.air._in_memory is False`. This is already a weird situation, since the data is not in memory, but it is not a dask array either. If I try to do anything beyond this with the data, it triggers eager computation. Even if I just call `type(ds.air.data)` it computes. If I do any arithmetic, it computes. In my case, I could get around this problem if the preprocess function in `open_mfdataset` were applied before the cf_decoding of the store. But in general, we really need to make `decode_cf` fully dask compatible.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	decode_cf() loads chunked arrays 221387277
293751509	https://github.com/pydata/xarray/issues/1372#issuecomment-293751509	https://api.github.com/repos/pydata/xarray/issues/1372	MDEyOklzc3VlQ29tbWVudDI5Mzc1MTUwOQ==	shoyer 1217238	2017-04-13T01:27:00Z	2017-04-13T01:27:00Z	MEMBER	But for this application in particular presumably all of these already tasks fuse together into a single composite task, yes? Then we would need to commute certain operations in some order (preferring to push down slicing operations). Then we would need to merge all of the slices, yes? Yes, that sounds right to me.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	decode_cf() loads chunked arrays 221387277
293749896	https://github.com/pydata/xarray/issues/1372#issuecomment-293749896	https://api.github.com/repos/pydata/xarray/issues/1372	MDEyOklzc3VlQ29tbWVudDI5Mzc0OTg5Ng==	mrocklin 306380	2017-04-13T01:15:19Z	2017-04-13T01:15:19Z	MEMBER	Optimization has come up a bit recently. In other projects like dask-glm I've actually been thinking about just turning it off entirely. The spread of desires here is wide. I think it would be good to make it a bit more customizable so that different applications can more easily customize their optimization suite (see https://github.com/dask/dask/issues/2206) But for this application in particular presumably all of these already tasks `fuse` together into a single composite task, yes? Then we would need to commute certain operations in some order (preferring to push down slicing operations). Then we would need to merge all of the slices, yes? I think it would be interesting to see how fast one could make such an optimization. If the answer is "decently fast" then we might also want to look into doing inplace operations. If the answer is "only sorta-fast" then I would hesitate before adding this as a default to dask.array but would again stress that we should work to make different projects customize optimization to suit their own needs.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	decode_cf() loads chunked arrays 221387277
293748654	https://github.com/pydata/xarray/issues/1372#issuecomment-293748654	https://api.github.com/repos/pydata/xarray/issues/1372	MDEyOklzc3VlQ29tbWVudDI5Mzc0ODY1NA==	shoyer 1217238	2017-04-13T01:05:42Z	2017-04-13T01:06:17Z	MEMBER	Ah, so here's the thing: `decode_cf` is actually lazy, but it uses some old xarray machinery for lazily decoding conventions instead of dask: `In [21]: xr.decode_cf(chunked).foo.variable._data Out[21]: LazilyIndexedArray(array=dask.array<xarray-foo, shape=(100,), dtype=int64, chunksize=(100,)>, key=(slice(None, None, None),))` You could call `.chunk()` on it again to load it into dask, but even though the whole thing is lazy, the new dask arrays don't know they are holding dask arrays, which makes life difficult for dask. You might ask why this separate lazy compute machinery exists. The answer is that dask fails to optimize element-wise operations like `(scale * array)[subset]` -> `scale * array[subset]`, which is a critical optimization for lazy decoding of large datasets. See https://github.com/dask/dask/issues/746 for discussion and links to PRs about this. @jcrist had a solution that worked, but it slowed down every dask array operations by 20%, which wasn't a great win. I wonder if this is worth revisiting with a simpler, less general optimization pass that doesn't bother with broadcasting. See the subclasses of `NDArrayMixin` in xarray/conventions.py for examples of the sorts of functionality we need: - Casting (e.g., `array.astype(bool)`). - Chained arithmetic with scalars (e.g., `0.5 + 0.5 * array`). - Custom element-wise operations (e.g., `map_blocks(convert_to_datetime64, array, dtype=np.datetime64)`) - Custom aggregations that drop a dimension (e.g., `map_blocks(characters_to_string, array, drop_axis=-1)`) If we could optimize all these operations (and ideally chain them), then we could drop all the lazy loading stuff from xarray in favor of dask, which would be a real win. @mrocklin any thoughts?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	decode_cf() loads chunked arrays 221387277
293743835	https://github.com/pydata/xarray/issues/1372#issuecomment-293743835	https://api.github.com/repos/pydata/xarray/issues/1372	MDEyOklzc3VlQ29tbWVudDI5Mzc0MzgzNQ==	shoyer 1217238	2017-04-13T00:27:32Z	2017-04-13T00:27:32Z	MEMBER	A simple test case: ``` In [16]: ds = xr.Dataset({'foo': ('x', np.arange(100))}) In [17]: chunked = ds.chunk() In [18]: chunked.foo Out[18]: <xarray.DataArray 'foo' (x: 100)> dask.array<xarray-foo, shape=(100,), dtype=int64, chunksize=(100,)> Dimensions without coordinates: x In [19]: xr.decode_cf(chunked).foo Out[19]: <xarray.DataArray 'foo' (x: 100)> array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]) Dimensions without coordinates: x ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	decode_cf() loads chunked arrays 221387277
293740415	https://github.com/pydata/xarray/issues/1372#issuecomment-293740415	https://api.github.com/repos/pydata/xarray/issues/1372	MDEyOklzc3VlQ29tbWVudDI5Mzc0MDQxNQ==	spencerahill 6200806	2017-04-13T00:02:56Z	2017-04-13T00:02:56Z	CONTRIBUTOR	+1 we have come across this recently also in aospy	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	decode_cf() loads chunked arrays 221387277
293725434	https://github.com/pydata/xarray/issues/1372#issuecomment-293725434	https://api.github.com/repos/pydata/xarray/issues/1372	MDEyOklzc3VlQ29tbWVudDI5MzcyNTQzNA==	shoyer 1217238	2017-04-12T22:28:17Z	2017-04-12T22:28:17Z	MEMBER	`decode_cf()` should definitely work lazily on dask array. If not, I would consider that a bug.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	decode_cf() loads chunked arrays 221387277

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);