home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

8 rows where issue = 214088387 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • shoyer 3
  • JoyMonteiro 3
  • rabernat 1
  • fmaussion 1

author_association 2

  • MEMBER 5
  • NONE 3

issue 1

  • Using groupby with custom index · 8 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
286779750 https://github.com/pydata/xarray/issues/1308#issuecomment-286779750 https://api.github.com/repos/pydata/xarray/issues/1308 MDEyOklzc3VlQ29tbWVudDI4Njc3OTc1MA== JoyMonteiro 7300413 2017-03-15T15:32:33Z 2017-03-15T15:32:33Z NONE

Not sure if this helps, but I did a %%timeit on both versions. For daily climatology, the numbers are: CPU times: user 1h 21min 8s, sys: 6h 17min 39s, total: 7h 38min 47s Wall time: 20min 34s

For the 6 hourly thing, CPU times: user 5h 5min 6s, sys: 1d 2h 19min 45s, total: 1d 7h 24min 51s Wall time: 1h 31min 40s

It takes around 4x more time, which makes sense because there are 4x more groups. The ratio of user to system time is more or less constant, so nothing untoward seems to be happening in between the two runs.

I think it is just good to remember that the time to use scales linearly with the number of groups. I guess this is what @shoyer was talking about when he mentioned that since grouping is done within xarray, the dask graph grows, making things slower.

Thanks again!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using groupby with custom index 214088387
286516988 https://github.com/pydata/xarray/issues/1308#issuecomment-286516988 https://api.github.com/repos/pydata/xarray/issues/1308 MDEyOklzc3VlQ29tbWVudDI4NjUxNjk4OA== shoyer 1217238 2017-03-14T18:29:55Z 2017-03-14T18:29:55Z MEMBER

I wonder if the fact that the data is highly compressed (short types converted to float64 with the scaled and offset attributes) can have an influence on dask performance and memory consumption? (especially the later)

Memory consumption, yes, performance, not so much. Scale/offset (de)compression can be applied super fast, unlike zlib compression which can be 10x slower than reading from disk.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using groupby with custom index 214088387
286511848 https://github.com/pydata/xarray/issues/1308#issuecomment-286511848 https://api.github.com/repos/pydata/xarray/issues/1308 MDEyOklzc3VlQ29tbWVudDI4NjUxMTg0OA== fmaussion 10050469 2017-03-14T18:13:18Z 2017-03-14T18:13:18Z MEMBER

I've had some troubles with 6-Hrly ERA-Interim data myself recently.

I wonder if the fact that the data is highly compressed (short types converted to float64 with the scaled and offset attributes) can have an influence on dask performance and memory consumption? (especially the later)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using groupby with custom index 214088387
286509639 https://github.com/pydata/xarray/issues/1308#issuecomment-286509639 https://api.github.com/repos/pydata/xarray/issues/1308 MDEyOklzc3VlQ29tbWVudDI4NjUwOTYzOQ== JoyMonteiro 7300413 2017-03-14T18:05:54Z 2017-03-14T18:05:54Z NONE

@shoyer If I increase the size of the longitude chunk anymore, it will almost like using no chunking at all. I guess this dataset is a corner case. I will try increasing doubling that value and see what happens. I hadn't realised that doing a groupby would also reduce the effective chunk size, thanks for pointing that out.

I'm using dask without distributed as of now, is there still some way to do the benchmark? I would be more than happy to run it.

@rabernat I would definitely favour a cloud based sandbox to try these things out. What would be the stumbling block towards actually setting it up? I have had some recent experience setting up jupyterhub, I can help set that up so that notebooks can be used easily in such an environment.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using groupby with custom index 214088387
286502400 https://github.com/pydata/xarray/issues/1308#issuecomment-286502400 https://api.github.com/repos/pydata/xarray/issues/1308 MDEyOklzc3VlQ29tbWVudDI4NjUwMjQwMA== shoyer 1217238 2017-03-14T17:43:13Z 2017-03-14T17:43:13Z MEMBER

We currently do all the groupby handling ourselves, which means that when you group over smaller units the dask graph gets bigger and each of the tasks gets smaller. Given that each chunk in the grouped data is only about ~250,000 elements, it's not surprising that things get a bit slower -- that's near the point where Python overhead starts to get significant.

It would be useful to benchmark graph creation and execution separately (especially using dask-distributed's profiling tools) to understand where the slow-down is.

One thing that might help quite a bit in cases like this where the individual groups are small is to rewrite xarray's groupby to do some groupby operations inside dask, rather than in a loop outside of dask. That would allow executing tasks on bigger chunks of arrays at once, which could significantly reduce scheduler overhead.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using groupby with custom index 214088387
286499366 https://github.com/pydata/xarray/issues/1308#issuecomment-286499366 https://api.github.com/repos/pydata/xarray/issues/1308 MDEyOklzc3VlQ29tbWVudDI4NjQ5OTM2Ng== rabernat 1197350 2017-03-14T17:33:36Z 2017-03-14T17:33:36Z MEMBER

Slightly OT observation: Performance issues are increasingly being raised here (see also #1301). Wouldn't it be great if we had shared space somewhere in the cloud to host these big-ish datasets and run performance benchmarks in a controlled environment?

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using groupby with custom index 214088387
286497255 https://github.com/pydata/xarray/issues/1308#issuecomment-286497255 https://api.github.com/repos/pydata/xarray/issues/1308 MDEyOklzc3VlQ29tbWVudDI4NjQ5NzI1NQ== JoyMonteiro 7300413 2017-03-14T17:27:06Z 2017-03-14T17:31:32Z NONE

Hello Stephan,

The shape of the full data, if I read from within xarray, is (time, level, lat, lon), with level=60, lat=41, lon=480. time is 4*365*7 ~ 10000.

I am chunking only along longitude, using lon=100. I previously chunked along time, but that used too much memory (~45GB out of 128 GB) since the data is split into one file per month, and reading annual data would require reading many files into memory.

Superficially, I would think that both of the above would take similar amounts of time. In fact, calculating a daily climatology also requires grouping the four 6 hourly data points into a single day as well, which seems to be more complicated. However, it seems to run faster!

Thanks, Joy

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using groupby with custom index 214088387
286482853 https://github.com/pydata/xarray/issues/1308#issuecomment-286482853 https://api.github.com/repos/pydata/xarray/issues/1308 MDEyOklzc3VlQ29tbWVudDI4NjQ4Mjg1Mw== shoyer 1217238 2017-03-14T16:43:27Z 2017-03-14T16:43:27Z MEMBER

Can you share the shape and dask chunking for data, and also describe how the data is stored? That can make a big difference for performance.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using groupby with custom index 214088387

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.364ms · About: xarray-datasette