html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/4475#issuecomment-702307334,https://api.github.com/repos/pydata/xarray/issues/4475,702307334,MDEyOklzc3VlQ29tbWVudDcwMjMwNzMzNA==,2560426,2020-10-01T18:07:55Z,2020-10-01T18:07:55Z,NONE,"Sounds good, I'll do this in the meantime. Still quite interested in `save_mfdataset` dealing with these lower level details, if possible. The ideal case would be loading with `load_mfdataset`, defining some ops lazily, then piping that directly to `save_mfdataset`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206
https://github.com/pydata/xarray/issues/4475#issuecomment-702276824,https://api.github.com/repos/pydata/xarray/issues/4475,702276824,MDEyOklzc3VlQ29tbWVudDcwMjI3NjgyNA==,2448579,2020-10-01T17:13:16Z,2020-10-01T17:13:16Z,MEMBER,"> doesn't that then require me to load everything into memory before writing?

I think so.


I would try multiple processes and see if that is fast enough for what you want to do. Or else, write to zarr. This will be parallelized and is a lot easier than dealing with HDF5","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206
https://github.com/pydata/xarray/issues/4475#issuecomment-702265883,https://api.github.com/repos/pydata/xarray/issues/4475,702265883,MDEyOklzc3VlQ29tbWVudDcwMjI2NTg4Mw==,2560426,2020-10-01T16:52:59Z,2020-10-01T16:52:59Z,NONE,"Multiple threads (the default), because it's recommended ""for numeric code that releases the GIL (like NumPy, Pandas, Scikit-Learn, Numba, …)"" according to the dask docs.

I guess I could do multi-threaded for the compute part (everything up to the definition of `ds`), then multi-process for the write part, but doesn't that then require me to load everything into memory before writing?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206
https://github.com/pydata/xarray/issues/4475#issuecomment-702226256,https://api.github.com/repos/pydata/xarray/issues/4475,702226256,MDEyOklzc3VlQ29tbWVudDcwMjIyNjI1Ng==,2448579,2020-10-01T15:46:45Z,2020-10-01T15:46:45Z,MEMBER,Are you using multiple threads or multiple processes? IIUC you should be using multiple processes for max writing efficiency.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206
https://github.com/pydata/xarray/issues/4475#issuecomment-702178407,https://api.github.com/repos/pydata/xarray/issues/4475,702178407,MDEyOklzc3VlQ29tbWVudDcwMjE3ODQwNw==,2560426,2020-10-01T14:34:28Z,2020-10-01T14:34:28Z,NONE,"Thank you, this works for me. However, it's quite slow and seems to scale faster than linearly as the length of `datasets` increases (the number of groups in the `groupby`).

Could it be connected to https://github.com/pydata/xarray/issues/2912#issuecomment-485497398 where they suggest to use `save_mfdataset` instead of `to_netcdf`? If so, there's a stronger case for supporting delayed objects in `save_mfdataset` as you said.

Appreciate the help!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206
https://github.com/pydata/xarray/issues/4475#issuecomment-701694586,https://api.github.com/repos/pydata/xarray/issues/4475,701694586,MDEyOklzc3VlQ29tbWVudDcwMTY5NDU4Ng==,1217238,2020-09-30T23:13:33Z,2020-09-30T23:13:33Z,MEMBER,"I think we could support delayed objects in `save_mfdataset`, at least in principle. But if you're OK using delayed objects, you might as well write each netCDF file separately using `dask.delayed`, e.g.,
```
def write_dataset(dataset, path):
  your_function(ds).to_netcdf(path)

result = [dask.delayed(write_dataset)(ds, path) for ds, path in zip(datasets, path)]
dask.compute(result)
```","{""total_count"": 3, ""+1"": 3, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206
https://github.com/pydata/xarray/issues/4475#issuecomment-701688956,https://api.github.com/repos/pydata/xarray/issues/4475,701688956,MDEyOklzc3VlQ29tbWVudDcwMTY4ODk1Ng==,2448579,2020-09-30T22:55:28Z,2020-09-30T22:55:28Z,MEMBER,"You could write to netCDF in `your_function` and avoid `save_mfdataset` altogether...

I guess this is a good argument for adding a `preprocess` kwarg.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206
https://github.com/pydata/xarray/issues/4475#issuecomment-701676076,https://api.github.com/repos/pydata/xarray/issues/4475,701676076,MDEyOklzc3VlQ29tbWVudDcwMTY3NjA3Ng==,2560426,2020-09-30T22:17:24Z,2020-09-30T22:17:24Z,NONE,"Unfortunately that doesn't work:

`TypeError: save_mfdataset only supports writing Dataset objects, received type <class 'dask.delayed.Delayed'>`","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206
https://github.com/pydata/xarray/issues/4475#issuecomment-701577652,https://api.github.com/repos/pydata/xarray/issues/4475,701577652,MDEyOklzc3VlQ29tbWVudDcwMTU3NzY1Mg==,2448579,2020-09-30T18:51:25Z,2020-09-30T18:51:25Z,MEMBER,"you could use `dask.delayed` here

```
new_datasets = [dask.delayed(your_function)(dset) for dset in datasets]
xr.save_mfdataset(new_datasets, paths)
```

I *think* this will work, but I've never used `save_mfdataset`. This is how `preprocess` is implemented with `open_mfdataset` btw.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206