issue_comments: 572311400

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/3668#issuecomment-572311400	https://api.github.com/repos/pydata/xarray/issues/3668	572311400	MDEyOklzc3VlQ29tbWVudDU3MjMxMTQwMA==	3922329	2020-01-08T23:41:22Z	2020-01-08T23:45:59Z	NONE	@rabernat Each Dask worker is running on its own machine. The data that I am trying to work with is distributed among workers, but all of it is accessible from any individual worker via cross-mounted NFS shares, so this works like a shared data storage, basically. None of that data is available on the client. For now, I'm trying to open just a single zarr store. I have only mentioned `open_mfdataset` as an example, because it has this `parallel` option, unlike `open_dataset` or `open_zarr`. This is really not about combining multiple datasets, but about working with data on a remote Dask cluster. Sorry, if I haven't made it absolutely clear from the start. @dcherian You mean this code? ```python def modify(ds): # modify ds here return ds this is basically what open_mfdataset does open_kwargs = dict(decode_cf=True, decode_times=False) open_tasks = [dask.delayed(xr.open_dataset)(f, open_kwargs) for f in file_names] tasks = [dask.delayed(modify)(task) for task in open_tasks] datasets = dask.compute(tasks) # get a list of xarray.Datasets combined = xr.combine_nested(datasets) # or some combination of concat, merge ``` In case of a single data source, I think, it can be condensed into this: `open_kwargs = dict( decode_cf=True, decode_times=False ) ds = dask.compute(dask.delayed(xr.open_dataset)(file_name, open_kwargs))[0]` But it doesn't work quite as I expected, either with zarr, or with NetCDF. First I'll have to explain what I get with `open_dataset` and a NetCDF file. The code above runs, but when I try to do calculations on the obtained dataset, for example `ds['Temp'].mean().compute()` I get `FileNotFoundError: [Errno 2] No such file or directory` on the client. Only if I wrap it in `dask.delayed` again, it will run properly: `dask.compute(dask.delayed(ds['Temp'].mean)())` So, this approach is not fully equivalent to what `open_mfdataset` does, and unfortunately that doesn't work for me, because I would like to be able to use the xarray dataset transparently, without having to program Dask explicitly. If I add `chunks={}` to `open_kwargs`, similar to this line in the `open_mfdataset` implementation https://github.com/pydata/xarray/blob/v0.14.1/xarray/backends/api.py#L885 , then it starts behaving exactly like `open_mfdataset` and I can use the dataset transparently. I don't quite understand what's going on there, but so far so good. Now, back to zarr: `ds = dask.compute(dask.delayed(xr.open_zarr)(zarr_dataset_path, **open_kwargs))[0]` doesn't run at all, regardless of the chunks setting, giving me `ValueError: group not found at path ''` so I don't even get a dataset object. Seems that something is quite different in the zarr backend implementation. I haven't had the chance to look at the code carefully yet, but I will do so in the next few days. Sorry for this long-winded explanation, I hope it clarifies what I'm trying to achieve here.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		546562676

issue_comments: 572311400

this is basically what open_mfdataset does