home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

7 rows where author_association = "MEMBER", issue = 224553135 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

These facets timed out: author_association, issue

user 1

  • shoyer · 7 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
439454213 https://github.com/pydata/xarray/issues/1385#issuecomment-439454213 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDQzOTQ1NDIxMw== shoyer 1217238 2018-11-16T16:46:55Z 2018-11-16T16:46:55Z MEMBER

Does it take 10 seconds even to open a single file? The big mystery is what that top line ("_operator.getitem") is but my guess is it's netCDF4-python. h5netcdf might also give different results... On Fri, Nov 16, 2018 at 8:20 AM chuaxr notifications@github.com wrote:

Sorry, I think the speedup had to do with accessing a file that had previously been loaded rather than due to decode_cf. Here's the output of prun using two different files of approximately the same size (~75 GB), run from a notebook without using distributed (which doesn't lead to any speedup):

Output of %prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/ atmos_level.1999010100-2000123123.sphum.nc ',chunks={'lat':20,'time':50,'lon':12,'pfull':11})

      780980 function calls (780741 primitive calls) in 55.374 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7   54.448    7.778   54.448    7.778 {built-in method _operator.getitem}
764838    0.473    0.000    0.473    0.000 core.py:169(<genexpr>)
     3    0.285    0.095    0.758    0.253 core.py:169(<listcomp>)
     2    0.041    0.020    0.041    0.020 {cftime._cftime.num2date}
     3    0.040    0.013    0.821    0.274 core.py:173(getem)
     1    0.027    0.027   55.374   55.374 <string>:1(<module>)

Output of %prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/ atmos_level.2001010100-2002123123.temp.nc ',chunks={'lat':20,'time':50,'lon':12,'pfull':11}, decode_cf=False)

      772212 function calls (772026 primitive calls) in 56.000 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5   55.213   11.043   55.214   11.043 {built-in method _operator.getitem}
764838    0.486    0.000    0.486    0.000 core.py:169(<genexpr>)
     3    0.185    0.062    0.671    0.224 core.py:169(<listcomp>)
     3    0.041    0.014    0.735    0.245 core.py:173(getem)
     1    0.027    0.027   56.001   56.001 <string>:1(<module>)

/work isn't a remote archive, so it surprises me that this should happen.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1385#issuecomment-439445695, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1jmFqfe9_dIgHAMYlVOh7WKhzO8Kks5uvuXKgaJpZM4NJOcQ .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
439263419 https://github.com/pydata/xarray/issues/1385#issuecomment-439263419 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDQzOTI2MzQxOQ== shoyer 1217238 2018-11-16T02:45:05Z 2018-11-16T02:45:05Z MEMBER

@chuaxr What do you see when you use %prun when opening the dataset? This might point to the bottleneck.

One way to fix this would be to move our call to decode_cf() in open_dataset() to after applying chunking, i.e., to switch up the order of operations on these lines: https://github.com/pydata/xarray/blob/f547ed0b379ef70a3bda5e77f66de95ec2332ddf/xarray/backends/api.py#L270-L296

In practice, is the difference between using xarray's internal lazy array classes for decoding and dask for decoding. I would expect to see small differences in performance between these approaches (especially when actually computing data), but for constructing the computation graph I would expect them to have similar performance. It is puzzling that dask is orders of magnitude faster -- that suggests that something else is going wrong in the normal code path for decode_cf(). It would certainly be good to understand this before trying to apply any fixes.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
438873285 https://github.com/pydata/xarray/issues/1385#issuecomment-438873285 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDQzODg3MzI4NQ== shoyer 1217238 2018-11-15T00:45:53Z 2018-11-15T00:45:53Z MEMBER

@chuaxr I assume you're testing this with xarray 0.11?

It would be good to do some profiling to figure out what is going wrong here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
437630511 https://github.com/pydata/xarray/issues/1385#issuecomment-437630511 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDQzNzYzMDUxMQ== shoyer 1217238 2018-11-10T23:38:10Z 2018-11-10T23:38:10Z MEMBER

Was this fixed by https://github.com/pydata/xarray/pull/2047?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
371933603 https://github.com/pydata/xarray/issues/1385#issuecomment-371933603 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDM3MTkzMzYwMw== shoyer 1217238 2018-03-09T20:17:19Z 2018-03-09T20:17:19Z MEMBER

OK, so it seems that we need a change to disable wrapping dask arrays with LazilyIndexedArray. Dask arrays are already lazy!

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
370092011 https://github.com/pydata/xarray/issues/1385#issuecomment-370092011 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDM3MDA5MjAxMQ== shoyer 1217238 2018-03-02T23:58:26Z 2018-03-02T23:58:26Z MEMBER

@rabernat How does performance compare if you call xarray.decode_cf() on the opened dataset? The adjustments I recently did to lazy decoding should only help once the data is already loaded into dask.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
297539517 https://github.com/pydata/xarray/issues/1385#issuecomment-297539517 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDI5NzUzOTUxNw== shoyer 1217238 2017-04-26T20:59:23Z 2017-04-26T20:59:23Z MEMBER

For example, can I give a hint to xarray that this reindex_variables step is not necessary

Yes, adding an boolean argument prealigned which defaults to False to concat seems like a very reasonable optimization here.

But more generally, I am a little surprised by how slow pandas.Index.get_indexer and pandas.Index.is_unique are. This suggests we should add a fast-path optimization to skip these steps in reindex_variables: https://github.com/pydata/xarray/blob/ab4ffee919d4abe9f6c0cf6399a5827c38b9eb5d/xarray/core/alignment.py#L302-L306

Basically, if index.equals(target), we should just set indexer = np.arange(target.size). Although, if we have duplicate values in the index, the operation should arguably fail for correctness.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 4724.355ms · About: xarray-datasette