home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where author_association = "MEMBER" and issue = 212561278 sorted by updated_at descending

✖
✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • rabernat 4
  • shoyer 3
  • jhamman 3

issue 1

  • open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 · 10 ✖

author_association 1

  • MEMBER · 10 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
344437569 https://github.com/pydata/xarray/issues/1301#issuecomment-344437569 https://api.github.com/repos/pydata/xarray/issues/1301 MDEyOklzc3VlQ29tbWVudDM0NDQzNzU2OQ== jhamman 2443309 2017-11-14T23:41:57Z 2017-11-14T23:41:57Z MEMBER

@friedrichknuth, any chance you can take a look at this with the latest v0.10 release candidate?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 212561278
291516997 https://github.com/pydata/xarray/issues/1301#issuecomment-291516997 https://api.github.com/repos/pydata/xarray/issues/1301 MDEyOklzc3VlQ29tbWVudDI5MTUxNjk5Nw== rabernat 1197350 2017-04-04T14:27:18Z 2017-04-04T14:27:18Z MEMBER

My understanding is that you are concatenating across the variable obs, so no, it wouldn't make sense to have obs be the same in all the datasets.

My tests showed that it's not necessarily the concat step that is slowing this down. Your profiling suggest that it's a netcdf datetime decoding issue.

I wonder if @shoyer or @jhamman have any ideas about how to improve performance here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 212561278
286220317 https://github.com/pydata/xarray/issues/1301#issuecomment-286220317 https://api.github.com/repos/pydata/xarray/issues/1301 MDEyOklzc3VlQ29tbWVudDI4NjIyMDMxNw== rabernat 1197350 2017-03-13T19:40:50Z 2017-03-13T19:40:50Z MEMBER

And the length of obs is different in each dataset. ```python

for myds in dsets: print(myds.dims) Frozen(SortedKeysDict({u'obs': 7537613})) Frozen(SortedKeysDict({u'obs': 7247697})) Frozen(SortedKeysDict({u'obs': 7497680})) Frozen(SortedKeysDict({u'obs': 7661468})) Frozen(SortedKeysDict({u'obs': 5750197})) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 212561278
286219858 https://github.com/pydata/xarray/issues/1301#issuecomment-286219858 https://api.github.com/repos/pydata/xarray/issues/1301 MDEyOklzc3VlQ29tbWVudDI4NjIxOTg1OA== rabernat 1197350 2017-03-13T19:39:15Z 2017-03-13T19:39:15Z MEMBER

There is definitely something funky with these datasets that is causing xarray to go very slow.

This is fast: ```python

%time dsets = [xr.open_dataset(fname) for fname in glob('*.nc')] CPU times: user 1.1 s, sys: 664 ms, total: 1.76 s Wall time: 1.78 s ```

But even just trying to print the repr is slow ```python

%time print(dsets[0]) CPU times: user 3.66 s, sys: 3.49 s, total: 7.15 s Wall time: 7.28 s ```

Maybe some of this has to do with the change at 0.9.0 to allowing index-less dimensions (i.e. coordinates are optional). All of these datasets have such a dimension, e.g. <xarray.Dataset> Dimensions: (obs: 7247697) Coordinates: lon (obs) float64 -124.3 -124.3 ... lat (obs) float64 44.64 44.64 ... time (obs) datetime64[ns] 2014-11-10T00:00:00.011253 ... Dimensions without coordinates: obs Data variables: oxy_calphase (obs) float64 3.293e+04 ... quality_flag (obs) |S2 'ok' 'ok' 'ok' ... ctdbp_no_seawater_conductivity_qc_executed (obs) uint8 29 29 29 29 29 ... ...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 212561278
285149350 https://github.com/pydata/xarray/issues/1301#issuecomment-285149350 https://api.github.com/repos/pydata/xarray/issues/1301 MDEyOklzc3VlQ29tbWVudDI4NTE0OTM1MA== rabernat 1197350 2017-03-08T19:52:11Z 2017-03-08T19:52:11Z MEMBER

I just tried this on a few different datasets. Comparing python 2.7, xarray 0.7.2, dask 0.7.1 (an old environment I had on hand) with python 2.7, xarray 0.9.1-28-g1cad803, dask 0.13.0 (my current "production" environment), I could not reproduce. The up-to-date stack was faster by a factor of < 2.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 212561278
285110824 https://github.com/pydata/xarray/issues/1301#issuecomment-285110824 https://api.github.com/repos/pydata/xarray/issues/1301 MDEyOklzc3VlQ29tbWVudDI4NTExMDgyNA== shoyer 1217238 2017-03-08T17:35:49Z 2017-03-08T17:35:49Z MEMBER

One thing that helps get a better profile is setting dask backend to the non-parallel sync option which gives cleaner profiles.

Indeed, this is highly recommended, see http://dask.pydata.org/en/latest/faq.html

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 212561278
284915063 https://github.com/pydata/xarray/issues/1301#issuecomment-284915063 https://api.github.com/repos/pydata/xarray/issues/1301 MDEyOklzc3VlQ29tbWVudDI4NDkxNTA2Mw== shoyer 1217238 2017-03-08T01:16:58Z 2017-03-08T01:16:58Z MEMBER

Hmm. It might be interesting to try lock=threading.Lock() to revert to the old version of the thread lock as well.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 212561278
284914442 https://github.com/pydata/xarray/issues/1301#issuecomment-284914442 https://api.github.com/repos/pydata/xarray/issues/1301 MDEyOklzc3VlQ29tbWVudDI4NDkxNDQ0Mg== jhamman 2443309 2017-03-08T01:13:35Z 2017-03-08T01:13:35Z MEMBER

This is what I'm seeing for my %prun profiling:

ncalls tottime percall cumtime percall filename:lineno(function) 204 19.783 0.097 19.783 0.097 {method 'acquire' of '_thread.lock' objects} 89208/51003 2.524 0.000 5.553 0.000 indexing.py:361(shape) 1 1.359 1.359 37.876 37.876 <string>:1(<module>) 71379/53550 1.242 0.000 3.266 0.000 utils.py:412(shape) 538295 0.929 0.000 1.317 0.000 {built-in method builtins.isinstance} 24674/13920 0.836 0.000 4.139 0.000 _collections_abc.py:756(update) 9 0.788 0.088 0.803 0.089 netCDF4_.py:178(_open_netcdf4_group)

Weren't there some recent changes to the thread lock related to dask distributed?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 212561278
284908153 https://github.com/pydata/xarray/issues/1301#issuecomment-284908153 https://api.github.com/repos/pydata/xarray/issues/1301 MDEyOklzc3VlQ29tbWVudDI4NDkwODE1Mw== shoyer 1217238 2017-03-08T00:38:55Z 2017-03-08T00:38:55Z MEMBER

Wow, that is pretty bad.

Try setting compat='broadcast_equals' in the open_mfdataset call, to restore the default value of that parameter prior v0.9.

If that doesn't help, try downgrading dask to see if it's responsible. Profiling results from %prun in IPython would also be helpful at tracking down the culprit.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 212561278
284905152 https://github.com/pydata/xarray/issues/1301#issuecomment-284905152 https://api.github.com/repos/pydata/xarray/issues/1301 MDEyOklzc3VlQ29tbWVudDI4NDkwNTE1Mg== jhamman 2443309 2017-03-08T00:22:10Z 2017-03-08T00:22:10Z MEMBER

I've also noticed that we have a bottleneck here.

@shoyer - any idea what we changed that could impact this? Could this be coming from a change upstream in dask?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 212561278

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 49.911ms · About: xarray-datasette
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows