github: issue_comments: 7 rows where author_association = "MEMBER", issue = 224553135 and user = 1217238 sorted by updated

7 rows where author_association = "MEMBER", issue = 224553135 and user = 1217238 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
439454213	https://github.com/pydata/xarray/issues/1385#issuecomment-439454213	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDQzOTQ1NDIxMw==	shoyer 1217238	2018-11-16T16:46:55Z	2018-11-16T16:46:55Z	MEMBER	Does it take 10 seconds even to open a single file? The big mystery is what that top line ("_operator.getitem") is but my guess is it's netCDF4-python. h5netcdf might also give different results... On Fri, Nov 16, 2018 at 8:20 AM chuaxr notifications@github.com wrote: Sorry, I think the speedup had to do with accessing a file that had previously been loaded rather than due to decode_cf. Here's the output of prun using two different files of approximately the same size (~75 GB), run from a notebook without using distributed (which doesn't lead to any speedup): Output of %prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/ atmos_level.1999010100-2000123123.sphum.nc ',chunks={'lat':20,'time':50,'lon':12,'pfull':11}) `780980 function calls (780741 primitive calls) in 55.374 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 7 54.448 7.778 54.448 7.778 {built-in method _operator.getitem} 764838 0.473 0.000 0.473 0.000 core.py:169(<genexpr>) 3 0.285 0.095 0.758 0.253 core.py:169(<listcomp>) 2 0.041 0.020 0.041 0.020 {cftime._cftime.num2date} 3 0.040 0.013 0.821 0.274 core.py:173(getem) 1 0.027 0.027 55.374 55.374 <string>:1(<module>)` Output of %prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/ atmos_level.2001010100-2002123123.temp.nc ',chunks={'lat':20,'time':50,'lon':12,'pfull':11}, decode_cf=False) `772212 function calls (772026 primitive calls) in 56.000 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 5 55.213 11.043 55.214 11.043 {built-in method _operator.getitem} 764838 0.486 0.000 0.486 0.000 core.py:169(<genexpr>) 3 0.185 0.062 0.671 0.224 core.py:169(<listcomp>) 3 0.041 0.014 0.735 0.245 core.py:173(getem) 1 0.027 0.027 56.001 56.001 <string>:1(<module>)` /work isn't a remote archive, so it surprises me that this should happen. — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1385#issuecomment-439445695, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1jmFqfe9_dIgHAMYlVOh7WKhzO8Kks5uvuXKgaJpZM4NJOcQ .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
439263419	https://github.com/pydata/xarray/issues/1385#issuecomment-439263419	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDQzOTI2MzQxOQ==	shoyer 1217238	2018-11-16T02:45:05Z	2018-11-16T02:45:05Z	MEMBER	@chuaxr What do you see when you use `%prun` when opening the dataset? This might point to the bottleneck. One way to fix this would be to move our call to `decode_cf()` in `open_dataset()` to after applying chunking, i.e., to switch up the order of operations on these lines: https://github.com/pydata/xarray/blob/f547ed0b379ef70a3bda5e77f66de95ec2332ddf/xarray/backends/api.py#L270-L296 In practice, is the difference between using xarray's internal lazy array classes for decoding and dask for decoding. I would expect to see small differences in performance between these approaches (especially when actually computing data), but for constructing the computation graph I would expect them to have similar performance. It is puzzling that dask is orders of magnitude faster -- that suggests that something else is going wrong in the normal code path for `decode_cf()`. It would certainly be good to understand this before trying to apply any fixes.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
438873285	https://github.com/pydata/xarray/issues/1385#issuecomment-438873285	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDQzODg3MzI4NQ==	shoyer 1217238	2018-11-15T00:45:53Z	2018-11-15T00:45:53Z	MEMBER	@chuaxr I assume you're testing this with xarray 0.11? It would be good to do some profiling to figure out what is going wrong here.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
437630511	https://github.com/pydata/xarray/issues/1385#issuecomment-437630511	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDQzNzYzMDUxMQ==	shoyer 1217238	2018-11-10T23:38:10Z	2018-11-10T23:38:10Z	MEMBER	Was this fixed by https://github.com/pydata/xarray/pull/2047?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
371933603	https://github.com/pydata/xarray/issues/1385#issuecomment-371933603	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDM3MTkzMzYwMw==	shoyer 1217238	2018-03-09T20:17:19Z	2018-03-09T20:17:19Z	MEMBER	OK, so it seems that we need a change to disable wrapping dask arrays with `LazilyIndexedArray`. Dask arrays are already lazy!	{ "total_count": 3, "+1": 3, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
370092011	https://github.com/pydata/xarray/issues/1385#issuecomment-370092011	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDM3MDA5MjAxMQ==	shoyer 1217238	2018-03-02T23:58:26Z	2018-03-02T23:58:26Z	MEMBER	@rabernat How does performance compare if you call `xarray.decode_cf()` on the opened dataset? The adjustments I recently did to lazy decoding should only help once the data is already loaded into dask.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
297539517	https://github.com/pydata/xarray/issues/1385#issuecomment-297539517	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDI5NzUzOTUxNw==	shoyer 1217238	2017-04-26T20:59:23Z	2017-04-26T20:59:23Z	MEMBER	For example, can I give a hint to xarray that this reindex_variables step is not necessary Yes, adding an boolean argument `prealigned` which defaults to `False` to `concat` seems like a very reasonable optimization here. But more generally, I am a little surprised by how slow `pandas.Index.get_indexer` and `pandas.Index.is_unique` are. This suggests we should add a fast-path optimization to skip these steps in `reindex_variables`: https://github.com/pydata/xarray/blob/ab4ffee919d4abe9f6c0cf6399a5827c38b9eb5d/xarray/core/alignment.py#L302-L306 Basically, if `index.equals(target)`, we should just set `indexer = np.arange(target.size)`. Although, if we have duplicate values in the index, the operation should arguably fail for correctness.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);