home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 572311400

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/3668#issuecomment-572311400 https://api.github.com/repos/pydata/xarray/issues/3668 572311400 MDEyOklzc3VlQ29tbWVudDU3MjMxMTQwMA== 3922329 2020-01-08T23:41:22Z 2020-01-08T23:45:59Z NONE

@rabernat Each Dask worker is running on its own machine. The data that I am trying to work with is distributed among workers, but all of it is accessible from any individual worker via cross-mounted NFS shares, so this works like a shared data storage, basically. None of that data is available on the client.

For now, I'm trying to open just a single zarr store. I have only mentioned open_mfdataset as an example, because it has this parallel option, unlike open_dataset or open_zarr. This is really not about combining multiple datasets, but about working with data on a remote Dask cluster. Sorry, if I haven't made it absolutely clear from the start.

@dcherian You mean this code?

```python def modify(ds): # modify ds here return ds

this is basically what open_mfdataset does

open_kwargs = dict(decode_cf=True, decode_times=False) open_tasks = [dask.delayed(xr.open_dataset)(f, **open_kwargs) for f in file_names] tasks = [dask.delayed(modify)(task) for task in open_tasks] datasets = dask.compute(tasks) # get a list of xarray.Datasets combined = xr.combine_nested(datasets) # or some combination of concat, merge ```

In case of a single data source, I think, it can be condensed into this: open_kwargs = dict( decode_cf=True, decode_times=False ) ds = dask.compute(dask.delayed(xr.open_dataset)(file_name, **open_kwargs))[0] But it doesn't work quite as I expected, either with zarr, or with NetCDF. First I'll have to explain what I get with open_dataset and a NetCDF file. The code above runs, but when I try to do calculations on the obtained dataset, for example

ds['Temp'].mean().compute()

I get

FileNotFoundError: [Errno 2] No such file or directory

on the client. Only if I wrap it in dask.delayed again, it will run properly:

dask.compute(dask.delayed(ds['Temp'].mean)())

So, this approach is not fully equivalent to what open_mfdataset does, and unfortunately that doesn't work for me, because I would like to be able to use the xarray dataset transparently, without having to program Dask explicitly.

If I add chunks={} to open_kwargs, similar to this line in the open_mfdataset implementation https://github.com/pydata/xarray/blob/v0.14.1/xarray/backends/api.py#L885 , then it starts behaving exactly like open_mfdataset and I can use the dataset transparently. I don't quite understand what's going on there, but so far so good.

Now, back to zarr: ds = dask.compute(dask.delayed(xr.open_zarr)(zarr_dataset_path, **open_kwargs))[0] doesn't run at all, regardless of the chunks setting, giving me

ValueError: group not found at path ''

so I don't even get a dataset object. Seems that something is quite different in the zarr backend implementation. I haven't had the chance to look at the code carefully yet, but I will do so in the next few days.

Sorry for this long-winded explanation, I hope it clarifies what I'm trying to achieve here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  546562676
Powered by Datasette · Queries took 161.715ms · About: xarray-datasette