github: issue_comments: 10 rows where issue = 1277437106 sorted by updated

10 rows where issue = 1277437106 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
1165001097	https://github.com/pydata/xarray/issues/6709#issuecomment-1165001097	https://api.github.com/repos/pydata/xarray/issues/6709	IC_kwDOAMm_X85FcIGJ	gjoseph92 3309802	2022-06-23T23:15:19Z	2022-06-23T23:15:19Z	NONE	I took a little bit more of a look at this and I don't think root task overproduction is the (only) problem here. I also feel like intuitively, this operation shouldn't require holding so many root tasks around at once. But the graph dask is making, or how it's ordering it, doesn't seem to work that way. We can see the ordering is pretty bad: When we actually run it (on https://github.com/dask/distributed/pull/6614 with overproduction fixed), you can see that dask requires keeping tons of the input chunks in memory, because they're going to be needed by a future task that isn't able to run yet (because not all of its inputs have been computed): I feel like it's possible that the order in which dask is executing the input tasks is bad? But I more thank that I haven't thought about the problem enough, and there's an obvious reason why the graph is structured like this.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Means of zarr arrays cause a memory overload in dask workers 1277437106
1164690164	https://github.com/pydata/xarray/issues/6709#issuecomment-1164690164	https://api.github.com/repos/pydata/xarray/issues/6709	IC_kwDOAMm_X85Fa8L0	gjoseph92 3309802	2022-06-23T17:37:59Z	2022-06-23T17:37:59Z	NONE	FYI @robin-cls I would be a bit surprised if there is anything you can do on your end to fix things here with off-the-shelf dask. What @dcherian mentioned in https://github.com/dask/distributed/issues/6360#issuecomment-1129484190 is probably the only thing that might work. Otherwise you'll need to run one my experimental branches.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Means of zarr arrays cause a memory overload in dask workers 1277437106
1164686511	https://github.com/pydata/xarray/issues/6709#issuecomment-1164686511	https://api.github.com/repos/pydata/xarray/issues/6709	IC_kwDOAMm_X85Fa7Sv	robin-cls 74916839	2022-06-23T17:33:48Z	2022-06-23T17:33:48Z	NONE	Thanks @gjoseph92 and @dcherian . I'll try the different approaches in the links you have provided to see if I can improve my current solution (I compute the fields separately which means more IO and more operations)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Means of zarr arrays cause a memory overload in dask workers 1277437106
1164672075	https://github.com/pydata/xarray/issues/6709#issuecomment-1164672075	https://api.github.com/repos/pydata/xarray/issues/6709	IC_kwDOAMm_X85Fa3xL	dcherian 2448579	2022-06-23T17:18:05Z	2022-06-23T17:18:05Z	MEMBER	Thanks @gjoseph92 I think there is a small but important increase in complexity here because we do `ds.u.mean(), ds.v.mean(), (ds.uds.v).mean()` all together so each chunk of `ds.u` and `ds.v` is used for two different outputs. IIUC the example in https://github.com/dask/distributed/issues/6571 is basically `(ds.u ds.v).mean()` purely.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Means of zarr arrays cause a memory overload in dask workers 1277437106
1164660225	https://github.com/pydata/xarray/issues/6709#issuecomment-1164660225	https://api.github.com/repos/pydata/xarray/issues/6709	IC_kwDOAMm_X85Fa04B	gjoseph92 3309802	2022-06-23T17:05:12Z	2022-06-23T17:05:12Z	NONE	Thanks @dcherian, yeah this is definitely root task overproduction. I think your case is somewhat similar to @TomNicholas's https://github.com/dask/distributed/issues/6571 (that one might even be a little simpler actually). There's some prototyping going on to address this, but I'd say "soon" is probably on the couple month timescale right now FYI. https://github.com/dask/distributed/pull/6598 or https://github.com/dask/distributed/pull/6614 will probably make this work. I'm hopefully going to benchmark these against some real workloads in the next couple days, so I'll probably add yours in. Thanks for the MVCE! Is my understanding of distributed mean wrong ? Why are the random-sample not flushed? See https://github.com/dask/distributed/issues/6360#issuecomment-1129434333 and the linked issues for why this happens.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Means of zarr arrays cause a memory overload in dask workers 1277437106
1164519484	https://github.com/pydata/xarray/issues/6709#issuecomment-1164519484	https://api.github.com/repos/pydata/xarray/issues/6709	IC_kwDOAMm_X85FaSg8	dcherian 2448579	2022-06-23T14:59:57Z	2022-06-23T15:07:11Z	MEMBER	This looks like the classic distributed scheduling issue that might get addressed soon: https://github.com/dask/distributed/issues/6560 @gjoseph92 the MVCE in https://github.com/pydata/xarray/issues/6709#issuecomment-1164064469 might be a useful benchmark / regression test (`ds.u.mean()`, `ds.v.mean()`, `(ds.u * ds.v).mean()` all together). This kind of dependency -- variables and combinations of multiple variable being computed together -- turns up a lot in climate model workflows).	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Means of zarr arrays cause a memory overload in dask workers 1277437106
1164064469	https://github.com/pydata/xarray/issues/6709#issuecomment-1164064469	https://api.github.com/repos/pydata/xarray/issues/6709	IC_kwDOAMm_X85FYjbV	robin-cls 74916839	2022-06-23T07:40:33Z	2022-06-23T07:43:20Z	NONE	Thanks for the tips, I was investigating inline_array=True and still no luck. The graph seems OK though. I can attach it if you want but I think zarr is not the culprit. Here is why : In the first case, where we build the array from scratch, the ones array is simple. Dask seems to understand that it does not have to make many copies of it. So when replacing ones with random data, we observe the same behavior as opening the dataset from a ZARR store (high memory usage on a worker) : ```python import dask.array as da import numpy as np import xarray as xr ds = xr.Dataset( dict( anom_u=(["time", "face", "j", "i"], da.random.random((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), anom_v=(["time", "face", "j", "i"], da.random.random((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))), ) ) ds["anom_uu_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data2, axis=0)) ds["anom_vv_mean"] = (["face", "j", "i"], np.mean(ds.anom_v.data2, axis=0)) ds["anom_uv_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0)) ds[["anom_uu_mean", "anom_vv_mean", "anom_uv_mean"]].compute() ``` I think the question now is why Dask must load so many data when doing my operation : If we take the computation graph (I've put the non optimized version), my understanding is that we could do the following : - Load the first chunk of anom_u - Load the second chunk of anom_v - Do the multiplication anom_uanom_v, anom_u, anom_v * 2 - Do the mean-chunk task - Unload all the previous tasks - Redo the same and combine the mean-chunks tasks For information, one chunk is about 1.4G, so I expect see peaks of 5*1.4 = 7G in memory (plus what's needed to store the mean_chunk), but I instead got 15G+ in my worker, most of it taken by the random-samples Is my understanding of distributed mean wrong ? Why are the random-sample not flushed ?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Means of zarr arrays cause a memory overload in dask workers 1277437106
1163255775	https://github.com/pydata/xarray/issues/6709#issuecomment-1163255775	https://api.github.com/repos/pydata/xarray/issues/6709	IC_kwDOAMm_X85FVd_f	dcherian 2448579	2022-06-22T15:24:24Z	2022-06-22T15:24:24Z	MEMBER	You can use `dask.visualize(xarray_object)` You could try passing `inline_array=True` to `open_zarr`	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Means of zarr arrays cause a memory overload in dask workers 1277437106
1162833362	https://github.com/pydata/xarray/issues/6709#issuecomment-1162833362	https://api.github.com/repos/pydata/xarray/issues/6709	IC_kwDOAMm_X85FT23S	robin-cls 74916839	2022-06-22T08:54:18Z	2022-06-22T09:10:04Z	NONE	Hi @TomNicholas I've reduced the original dataset to 11 chunks over the time dimension so that we can see the graph properly. I also replaced the .compute operation by a to_zarr(compute=False) because I don't know how to visualize xarray operations without generating a Delayed object (comments are welcomed on this point !) Anyway here are the files, first one is the graph where the means are built from dask.ones arrays Second one is the graph where the means are built from the same arrays but opened from a zarr store I am quite a newbie in dask graphs debug but everything seems ok in the second graph, apart from the open_dataset tasks that are linked to a parent task. Also, I noticed that Dask have fused the 'ones' operations in the first graph. Would it help if I generated another arrays with zeros instead ?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Means of zarr arrays cause a memory overload in dask workers 1277437106
1161987610	https://github.com/pydata/xarray/issues/6709#issuecomment-1161987610	https://api.github.com/repos/pydata/xarray/issues/6709	IC_kwDOAMm_X85FQoYa	TomNicholas 35968931	2022-06-21T16:31:58Z	2022-06-21T16:31:58Z	MEMBER	Hi @robin-cls - can we see the dask graphs (or a representative subset of them) for these two cases?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Means of zarr arrays cause a memory overload in dask workers 1277437106

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);