issue_comments
13 rows where issue = 288785270 and user = 306380 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: created_at (date), updated_at (date)
issue 1
- groupby on dask objects doesn't handle chunks well · 13 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
363440952 | https://github.com/pydata/xarray/issues/1832#issuecomment-363440952 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM2MzQ0MDk1Mg== | mrocklin 306380 | 2018-02-06T14:36:55Z | 2018-02-06T14:36:55Z | MEMBER | Checking in here. Any luck? I noticed your comment in https://github.com/dask/distributed/issues/1736 but that seems to be a separate issue about file-based locks rather than about task scheduling priorities. Is the file-based locking stuff getting in the way of you checking for low-memory use? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
362772686 | https://github.com/pydata/xarray/issues/1832#issuecomment-362772686 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM2Mjc3MjY4Ng== | mrocklin 306380 | 2018-02-03T03:08:48Z | 2018-02-03T03:08:48Z | MEMBER | @rabernat you shouldn't need the spill to disk comment above, just things on master branches. Ideally you would try your clmatology computation again and see if memory use continues to exceed expectations. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
362433734 | https://github.com/pydata/xarray/issues/1832#issuecomment-362433734 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM2MjQzMzczNA== | mrocklin 306380 | 2018-02-01T23:14:22Z | 2018-02-01T23:14:22Z | MEMBER | The relevant PRs have been merged into master on both repositories. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
362115127 | https://github.com/pydata/xarray/issues/1832#issuecomment-362115127 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM2MjExNTEyNw== | mrocklin 306380 | 2018-02-01T00:16:14Z | 2018-02-01T00:16:14Z | MEMBER | @rabernat I recommend trying with a combination of these two PRs. These do well for me on the problem listed above. There is still some memory requriement, but it seems to be under better control |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
362107147 | https://github.com/pydata/xarray/issues/1832#issuecomment-362107147 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM2MjEwNzE0Nw== | mrocklin 306380 | 2018-01-31T23:34:16Z | 2018-01-31T23:34:16Z | MEMBER | Or, this might work in conjunction with https://github.com/dask/dask/pull/3066
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
358123564 | https://github.com/pydata/xarray/issues/1832#issuecomment-358123564 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM1ODEyMzU2NA== | mrocklin 306380 | 2018-01-16T22:07:11Z | 2018-01-31T17:31:49Z | MEMBER | Looking at the worker diagnostic page during execution is informative. It has a ton of work that it can do and a ton of communication that it can do (to share results with other workers to compute the reductions). In this example it's able to start new work much faster than it is able to communicate results to peers, leading to significant buildup. These two processes happen asynchronously without any back-pressure between them, leading to most of the input being produced before it can be reduced and processed. That's my current guess anyway. I could imagine pausing worker threads if there is a heavy communication buildup. I'm not sure how generally valuable this is though. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
362000310 | https://github.com/pydata/xarray/issues/1832#issuecomment-362000310 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM2MjAwMDMxMA== | mrocklin 306380 | 2018-01-31T17:05:13Z | 2018-01-31T17:05:13Z | MEMBER | @rabernat you might also consider turning off spill-to-disk. I suspect that by prioritizing the other mechanisms to slow processing that you'll have a better experience
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
358123872 | https://github.com/pydata/xarray/issues/1832#issuecomment-358123872 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM1ODEyMzg3Mg== | mrocklin 306380 | 2018-01-16T22:08:22Z | 2018-01-16T22:08:22Z | MEMBER | I encourage you to look at the diagnostic page for one of your workers if you get a chance. This is typically served on port 8789 if that port is open. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
358055204 | https://github.com/pydata/xarray/issues/1832#issuecomment-358055204 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM1ODA1NTIwNA== | mrocklin 306380 | 2018-01-16T18:15:03Z | 2018-01-16T18:15:35Z | MEMBER | This example is an interesting one that was adapted from something that @rabernat produced ```python import dask import xarray as xr import dask.array as da import pandas as pd from tornado import gen from dask.distributed import Client client = Client(processes=False) below I create a random dataset that is typical of high-res climate modelssize of example can be adjusted up and down by changing shapedims = ('time', 'depth', 'lat', 'lon') time = pd.date_range('1980-01-01', '1980-12-01', freq='1d') shape = (len(time), 5, 1800, 360) what I consider to be a reasonable chunk sizechunks = (1, 1, 1800, 360) ds = xr.Dataset({k: (dims, da.random.random(shape, chunks=chunks)) for k in ['u', 'v', 'w']}, coords={'time': time}) create seasonal climatologyds_clim = ds.groupby('time.week').mean(dim='time') construct seasonal anomalyds_anom = ds.groupby('time.week') - ds_clim compute variance of seasonal anomalyds_anom_var = (ds_anom**2).mean(dim='time') ds_anom_var.compute() ``` It works fine locally with |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
358053254 | https://github.com/pydata/xarray/issues/1832#issuecomment-358053254 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM1ODA1MzI1NA== | mrocklin 306380 | 2018-01-16T18:09:21Z | 2018-01-16T18:09:21Z | MEMBER | (not to sound too rosy though, these problems have had me stumped for a couple days) |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
358040339 | https://github.com/pydata/xarray/issues/1832#issuecomment-358040339 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM1ODA0MDMzOQ== | mrocklin 306380 | 2018-01-16T17:31:26Z | 2018-01-16T17:31:26Z | MEMBER |
That's not entirely true. I've said that delete-and-recompute is unlikely to be resolved in the near future. This is the solution proposed by @shoyer but only one possible solution. The fact that your for loop solution works well is evidence that delete-and-recompute is not necessary to solve this problem in your case. I'm actively working on this at https://github.com/dask/dask/pull/3066 (fortunately paid for by other groups). |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
358014639 | https://github.com/pydata/xarray/issues/1832#issuecomment-358014639 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM1ODAxNDYzOQ== | mrocklin 306380 | 2018-01-16T16:13:02Z | 2018-01-16T16:13:02Z | MEMBER | Teaching the scheduler to delete-and-recompute is possible but also expensive to implement. I would not expect it near term from me. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 | |
357996887 | https://github.com/pydata/xarray/issues/1832#issuecomment-357996887 | https://api.github.com/repos/pydata/xarray/issues/1832 | MDEyOklzc3VlQ29tbWVudDM1Nzk5Njg4Nw== | mrocklin 306380 | 2018-01-16T15:27:33Z | 2018-01-16T15:27:33Z | MEMBER | ```python monthly climatologyds_mm = ds.groupby('time.month').mean(dim='time') anomalyds_anom = ds.groupby('time.month')- ds_mm ``` I would actually hope that this would be a little bit nicer than the case in the dask issue, especially if you are chunked by some dimension other than time. In the case that @shoyer points to we're creating a global aggregation value and then applying that to all input data. In @rabernat's case we have at least twelve aggregation points and possibly more if there are other chunked dimensions like ensemble (or lat/lon if you choose to chunk those). |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
groupby on dask objects doesn't handle chunks well 288785270 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 1