home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

22 rows where author_association = "MEMBER" and issue = 288785270 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • mrocklin 13
  • rabernat 8
  • shoyer 1

issue 1

  • groupby on dask objects doesn't handle chunks well · 22 ✖

author_association 1

  • MEMBER · 22 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
559166277 https://github.com/pydata/xarray/issues/1832#issuecomment-559166277 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDU1OTE2NjI3Nw== rabernat 1197350 2019-11-27T16:45:14Z 2019-11-27T16:45:14Z MEMBER

I am trying a new approach to this problem using xarray's new map_blocks function. See this example: https://nbviewer.jupyter.org/gist/rabernat/30e7b747f0e3583b5b776e4093266114

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
499645413 https://github.com/pydata/xarray/issues/1832#issuecomment-499645413 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDQ5OTY0NTQxMw== rabernat 1197350 2019-06-06T20:01:40Z 2019-06-06T20:01:40Z MEMBER

In recent versions of xarray (0.12.1) and dask (0.12.1), this issue has been ameliorated significantly. I believe this issue should now be closed.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
363482773 https://github.com/pydata/xarray/issues/1832#issuecomment-363482773 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM2MzQ4Mjc3Mw== rabernat 1197350 2018-02-06T16:38:31Z 2018-02-06T16:38:31Z MEMBER

Short answer...no luck. With the latest masters (but without the suggested dask config), I am still getting the same basic performance limitations.

I can update you more when we talk in person later today.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
363440952 https://github.com/pydata/xarray/issues/1832#issuecomment-363440952 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM2MzQ0MDk1Mg== mrocklin 306380 2018-02-06T14:36:55Z 2018-02-06T14:36:55Z MEMBER

Checking in here. Any luck? I noticed your comment in https://github.com/dask/distributed/issues/1736 but that seems to be a separate issue about file-based locks rather than about task scheduling priorities. Is the file-based locking stuff getting in the way of you checking for low-memory use?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
362772686 https://github.com/pydata/xarray/issues/1832#issuecomment-362772686 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM2Mjc3MjY4Ng== mrocklin 306380 2018-02-03T03:08:48Z 2018-02-03T03:08:48Z MEMBER

@rabernat you shouldn't need the spill to disk comment above, just things on master branches. Ideally you would try your clmatology computation again and see if memory use continues to exceed expectations.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
362433734 https://github.com/pydata/xarray/issues/1832#issuecomment-362433734 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM2MjQzMzczNA== mrocklin 306380 2018-02-01T23:14:22Z 2018-02-01T23:14:22Z MEMBER

The relevant PRs have been merged into master on both repositories.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
362151486 https://github.com/pydata/xarray/issues/1832#issuecomment-362151486 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM2MjE1MTQ4Ng== rabernat 1197350 2018-02-01T03:59:28Z 2018-02-01T03:59:28Z MEMBER

@mrocklin thanks for the updates. I should have some time on Friday morning to give it a try on Cheyenne.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
362115127 https://github.com/pydata/xarray/issues/1832#issuecomment-362115127 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM2MjExNTEyNw== mrocklin 306380 2018-02-01T00:16:14Z 2018-02-01T00:16:14Z MEMBER

@rabernat I recommend trying with a combination of these two PRs. These do well for me on the problem listed above.

  • https://github.com/dask/dask/pull/3066
  • https://github.com/dask/distributed/pull/1730

There is still some memory requriement, but it seems to be under better control

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
362107147 https://github.com/pydata/xarray/issues/1832#issuecomment-362107147 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM2MjEwNzE0Nw== mrocklin 306380 2018-01-31T23:34:16Z 2018-01-31T23:34:16Z MEMBER

Or, this might work in conjunction with https://github.com/dask/dask/pull/3066

python diff --git a/distributed/worker.py b/distributed/worker.py index a1b9f32..62b5f07 100644 --- a/distributed/worker.py +++ b/distributed/worker.py @@ -1227,8 +1227,8 @@ class Worker(WorkerBase): def add_task(self, key, function=None, args=None, kwargs=None, task=None, who_has=None, nbytes=None, priority=None, duration=None, resource_restrictions=None, **kwargs2): - if isinstance(priority, list): - priority.insert(1, self.priority_counter) + # if isinstance(priority, list): + # priority.insert(1, self.priority_counter) try: if key in self.tasks: state = self.task_state[key]

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
358123564 https://github.com/pydata/xarray/issues/1832#issuecomment-358123564 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1ODEyMzU2NA== mrocklin 306380 2018-01-16T22:07:11Z 2018-01-31T17:31:49Z MEMBER

Looking at the worker diagnostic page during execution is informative. It has a ton of work that it can do and a ton of communication that it can do (to share results with other workers to compute the reductions). In this example it's able to start new work much faster than it is able to communicate results to peers, leading to significant buildup. These two processes happen asynchronously without any back-pressure between them, leading to most of the input being produced before it can be reduced and processed.

That's my current guess anyway. I could imagine pausing worker threads if there is a heavy communication buildup. I'm not sure how generally valuable this is though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
362000310 https://github.com/pydata/xarray/issues/1832#issuecomment-362000310 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM2MjAwMDMxMA== mrocklin 306380 2018-01-31T17:05:13Z 2018-01-31T17:05:13Z MEMBER

@rabernat you might also consider turning off spill-to-disk. I suspect that by prioritizing the other mechanisms to slow processing that you'll have a better experience

yaml worker-memory-target: False # target fraction to stay below worker-memory-spill: False # fraction at which we spill to disk worker-memory-pause: 0.80 # fraction at which we pause worker threads worker-memory-terminate: 0.95 # fraction at which we terminate the worker

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
358123872 https://github.com/pydata/xarray/issues/1832#issuecomment-358123872 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1ODEyMzg3Mg== mrocklin 306380 2018-01-16T22:08:22Z 2018-01-16T22:08:22Z MEMBER

I encourage you to look at the diagnostic page for one of your workers if you get a chance. This is typically served on port 8789 if that port is open.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
358106768 https://github.com/pydata/xarray/issues/1832#issuecomment-358106768 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1ODEwNjc2OA== rabernat 1197350 2018-01-16T21:10:37Z 2018-01-16T21:10:37Z MEMBER

Or maybe real data just gets in the way of the core dask issue?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
358106627 https://github.com/pydata/xarray/issues/1832#issuecomment-358106627 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1ODEwNjYyNw== rabernat 1197350 2018-01-16T21:10:12Z 2018-01-16T21:10:12Z MEMBER

I am developing a use case for this scenario using real data. I will put the data in cloud storage as soon as #1800 is merged. That should make it easier to debug.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
358055204 https://github.com/pydata/xarray/issues/1832#issuecomment-358055204 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1ODA1NTIwNA== mrocklin 306380 2018-01-16T18:15:03Z 2018-01-16T18:15:35Z MEMBER

This example is an interesting one that was adapted from something that @rabernat produced

```python import dask import xarray as xr import dask.array as da import pandas as pd from tornado import gen

from dask.distributed import Client client = Client(processes=False)

below I create a random dataset that is typical of high-res climate models

size of example can be adjusted up and down by changing shape

dims = ('time', 'depth', 'lat', 'lon') time = pd.date_range('1980-01-01', '1980-12-01', freq='1d') shape = (len(time), 5, 1800, 360)

what I consider to be a reasonable chunk size

chunks = (1, 1, 1800, 360) ds = xr.Dataset({k: (dims, da.random.random(shape, chunks=chunks)) for k in ['u', 'v', 'w']}, coords={'time': time})

create seasonal climatology

ds_clim = ds.groupby('time.week').mean(dim='time')

construct seasonal anomaly

ds_anom = ds.groupby('time.week') - ds_clim

compute variance of seasonal anomaly

ds_anom_var = (ds_anom**2).mean(dim='time') ds_anom_var.compute() ```

It works fine locally with processes=False and poorly with processes=True. If anyone has time to help on this issue I recommend investigating what is different in these two cases. If I had time I would start here by trying to improve our understanding with better visual diagnostics, although just poring over logs might also provide some insight.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
358053254 https://github.com/pydata/xarray/issues/1832#issuecomment-358053254 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1ODA1MzI1NA== mrocklin 306380 2018-01-16T18:09:21Z 2018-01-16T18:09:21Z MEMBER

(not to sound too rosy though, these problems have had me stumped for a couple days)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
358040339 https://github.com/pydata/xarray/issues/1832#issuecomment-358040339 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1ODA0MDMzOQ== mrocklin 306380 2018-01-16T17:31:26Z 2018-01-16T17:31:26Z MEMBER

Since @mrocklin has made it pretty clear that dask will not automatically solve this for us any time soon, we need to brainstorm some creative ways to make this extremely common use case more friendly with out-of-core data.

That's not entirely true. I've said that delete-and-recompute is unlikely to be resolved in the near future. This is the solution proposed by @shoyer but only one possible solution. The fact that your for loop solution works well is evidence that delete-and-recompute is not necessary to solve this problem in your case. I'm actively working on this at https://github.com/dask/dask/pull/3066 (fortunately paid for by other groups).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
358017392 https://github.com/pydata/xarray/issues/1832#issuecomment-358017392 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1ODAxNzM5Mg== rabernat 1197350 2018-01-16T16:21:37Z 2018-01-16T16:21:37Z MEMBER

Below is how I work around the issue in practice: writing a loop over each item in the groupby, and then looping over each variable, loading, and writing to disk.

python gb = ds.groupby('time.month') for month, dsm in gb: dsm_anom2 = ((dsm - ds_mm.sel(month=month))**2).mean(dim='time') dsm_anom2 = dsm_anom2.rename({f: f + '2' for f in fields}) dsm_anom2.coords['month'] = month for var in dsm_anom2.data_vars: filename = save_dir + '%02d.%s_%s.nc' % (month, prefix, var) print(filename) ds_out = dsm_anom2[[var]].load() ds_out.to_netcdf(filename)

Needless to say, this feels more like my pre-xarray/dask workflow.

Since @mrocklin has made it pretty clear that dask will not automatically solve this for us any time soon, we need to brainstorm some creative ways to make this extremely common use case more friendly with out-of-core data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
358014639 https://github.com/pydata/xarray/issues/1832#issuecomment-358014639 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1ODAxNDYzOQ== mrocklin 306380 2018-01-16T16:13:02Z 2018-01-16T16:13:02Z MEMBER

Teaching the scheduler to delete-and-recompute is possible but also expensive to implement. I would not expect it near term from me.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
358003205 https://github.com/pydata/xarray/issues/1832#issuecomment-358003205 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1ODAwMzIwNQ== rabernat 1197350 2018-01-16T15:41:25Z 2018-01-16T15:41:25Z MEMBER

The operation python ds_anom = ds - ds.mean(dim='time') is also extremely common. Both should work well by default.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
357996887 https://github.com/pydata/xarray/issues/1832#issuecomment-357996887 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1Nzk5Njg4Nw== mrocklin 306380 2018-01-16T15:27:33Z 2018-01-16T15:27:33Z MEMBER

```python

monthly climatology

ds_mm = ds.groupby('time.month').mean(dim='time')

anomaly

ds_anom = ds.groupby('time.month')- ds_mm ```

I would actually hope that this would be a little bit nicer than the case in the dask issue, especially if you are chunked by some dimension other than time. In the case that @shoyer points to we're creating a global aggregation value and then applying that to all input data. In @rabernat's case we have at least twelve aggregation points and possibly more if there are other chunked dimensions like ensemble (or lat/lon if you choose to chunk those).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270
357883913 https://github.com/pydata/xarray/issues/1832#issuecomment-357883913 https://api.github.com/repos/pydata/xarray/issues/1832 MDEyOklzc3VlQ29tbWVudDM1Nzg4MzkxMw== shoyer 1217238 2018-01-16T08:13:39Z 2018-01-16T08:13:39Z MEMBER

See also https://github.com/dask/dask/issues/874

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby on dask objects doesn't handle chunks well 288785270

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.683ms · About: xarray-datasette