html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/6803#issuecomment-1280786780,https://api.github.com/repos/pydata/xarray/issues/6803,1280786780,IC_kwDOAMm_X85MV0Fc,33886395,2022-10-17T12:33:18Z,2022-10-17T12:33:18Z,NONE,"I will try that. I still find it weird that I need to wrap a numpy object into a task/xarray object to be able to send it to workers when there is dask.scatter made for exactly that purpose. Thanks for opening that issue. I do feel there is the need to revisit scatter functionality and role particularly around dynamic clusters. Having a better look at your initial comment, that may still work if you call `Future.result()` method inside the function applied. That in theory should retrieve the data associated with that Future, in that case ""Hello World"". However, in a dark gateway setup that will fail","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1307523148 https://github.com/pydata/xarray/issues/6803#issuecomment-1280764797,https://api.github.com/repos/pydata/xarray/issues/6803,1280764797,IC_kwDOAMm_X85MVut9,6213168,2022-10-17T12:15:36Z,2022-10-17T12:20:02Z,MEMBER,"```python new_data_future = xr.apply_ufunc( _copy_test, data, a_x, ... ) ``` *instead* of using kwargs. I've opened https://github.com/dask/distributed/issues/7140 to simplify this. With it implemented, my snippet ```python test = np.full((20,), 30) a = da.from_array(test) dsk = client.scatter(dict(a.dask), broadcast=True) a = da.Array(dsk, name=a.name, chunks=a.chunks, dtype=a.dtype, meta=a._meta, shape=a.shape) a_x = xarray.DataArray(a, dims=[""new_z""]) ``` would become ```python test = np.full((20,), 30) a_x = xarray.DataArray(test, dims=[""new_z""]).chunk() a_x = client.scatter(a_x) ``` ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1307523148 https://github.com/pydata/xarray/issues/6803#issuecomment-1280759221,https://api.github.com/repos/pydata/xarray/issues/6803,1280759221,IC_kwDOAMm_X85MVtW1,33886395,2022-10-17T12:11:05Z,2022-10-17T12:11:05Z,NONE,"I'm not sure I understand the code above. In my case I have an array of approximately 300k elements that each and every function call needs to have access. I can pass it as a kwargs in its numpy form, but once I scale up the calculation across a large dataset (many large chunks) such array gets replicated for every task pushing the scheduler out of memory. That is why I tried to send the dataset to the cluster beforehand using scatter, but I cannot resolve the Future at the workers","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1307523148 https://github.com/pydata/xarray/issues/6803#issuecomment-1280746923,https://api.github.com/repos/pydata/xarray/issues/6803,1280746923,IC_kwDOAMm_X85MVqWr,6213168,2022-10-17T12:01:17Z,2022-10-17T12:01:17Z,MEMBER,"Having said the above, your design is... contrived. There isn't, as of today, a straightforward way to scatter a local dask collection (`persist()` will push the whole thing through the scheduler and likely send it out of memory). Workaround: ```python test = np.full((20,), 30) a = da.from_array(test) dsk = client.scatter(dict(a.dask), broadcast=True) a = da.Array(dsk, name=a.name, chunks=a.chunks, dtype=a.dtype, meta=a._meta, shape=a.shape) a_x = xarray.DataArray(a, dims=[""new_z""]) ``` Once you have a_x, you just pass it to the args (not kwargs) of apply_ufunc.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1307523148 https://github.com/pydata/xarray/issues/6803#issuecomment-1280743293,https://api.github.com/repos/pydata/xarray/issues/6803,1280743293,IC_kwDOAMm_X85MVpd9,33886395,2022-10-17T11:59:19Z,2022-10-17T11:59:19Z,NONE,"I can add that this problem is augmented in a dask_gateway system where the task just fails. With `apply_ufunc` I never received an error but in similar context I obtained something very similar to https://github.com/dask/dask-gateway/issues/404. My interpretation is that the Future is resolved at the worker (or in case of apply_ufunc a thread of this worker) and embeds a reference to the Client object. This last however uses a gateway connection that is not understood by the worker as generally is the scheduler dealing with those","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1307523148 https://github.com/pydata/xarray/issues/6803#issuecomment-1280729879,https://api.github.com/repos/pydata/xarray/issues/6803,1280729879,IC_kwDOAMm_X85MVmMX,6213168,2022-10-17T11:45:31Z,2022-10-17T11:45:31Z,MEMBER,"> This is still an issue. I noticed that the documentation of `map_blocks` states: **kwargs** ([mapping](https://docs.python.org/3/glossary.html#term-mapping)) – Passed verbatim to func after unpacking. xarray objects, if any, will not be subset to blocks. _Passing dask collections in kwargs is not allowed_. > > Is this the case for `apply_ufunc` as well? test_future is not a dask collection. It's a distributed.Future, which points to an arbitrary, opaque data blob that xarray has no means to know about. FWIW, I could reproduce the issue, where the future in the kwargs is not resolved to the data it points to as one would expect. Minimal reproducer: ```python import distributed import xarray client = distributed.Client(processes=False) x = xarray.DataArray([1, 2]).chunk() test_future = client.scatter(""Hello World"") def f(d, test): print(test) return d y = xarray.apply_ufunc( f, x, dask='parallelized', output_dtypes=""float64"", kwargs={'test':test_future}, ) y.compute() ``` Expected print output: `Hello World` Actual print output: ` `","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1307523148 https://github.com/pydata/xarray/issues/6803#issuecomment-1264523142,https://api.github.com/repos/pydata/xarray/issues/6803,1264523142,IC_kwDOAMm_X85LXxeG,33886395,2022-10-02T01:29:35Z,2022-10-02T01:29:35Z,NONE,"I think I may have narrowed down the problem to a limitation in dask using dask_gateway. If passing a Future to a worker, the worker will try to unpickle that Future, and as part of that unpickle the Client object passed when creating such Future. Unfortunately, in a dask_gateway context the client is behind a `gateway` connection that is not understood by the worker as normally does not have to deal with a gateway at all. In my case I do not get any error message, just the task failing and retrying over and over, but fiddling around I managed to get the same error as this post (https://stackoverflow.com/questions/70775315/scattering-data-to-dask-cluster-workers-unknown-address-scheme-gateway)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1307523148 https://github.com/pydata/xarray/issues/6803#issuecomment-1260319916,https://api.github.com/repos/pydata/xarray/issues/6803,1260319916,IC_kwDOAMm_X85LHvSs,33886395,2022-09-28T02:53:25Z,2022-09-28T02:53:25Z,NONE,"This is still an issue. I noticed that the documentation of `map_blocks` states: __kwargs__ ([mapping](https://docs.python.org/3/glossary.html#term-mapping)) – Passed verbatim to func after unpacking. xarray objects, if any, will not be subset to blocks. _Passing dask collections in kwargs is not allowed_. Is this the case for `apply_ufunc` as well? if yes than it is not documented. Is there another recommended way to pass data to workers without clogging the scheduler for this application?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1307523148