id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1391738128,I_kwDOAMm_X85S9D0Q,7109,Multiprocessing unable to pickle Dataset opened with open_mfdataset,18426352,closed,0,,,4,2022-09-30T02:43:43Z,2022-10-11T16:44:36Z,2022-10-11T16:44:35Z,CONTRIBUTOR,,,,"### What happened? When passing a Dataset object opened using `open_mfdataset` to a function via Python's mutliprocessing.Pool module, I received the following error: `AttributeError: Can't pickle local object 'open_mfdataset..multi_file_closer` ### What did you expect to happen? I expected the Dataset to be handed off to the function via multiprocessing without error. I can remove the error by using variable subsetting or other reduction, like via `where`, so I don't understand why the original Dataset object returned from open_mfdataset cannot be used. ### Minimal Complete Verifiable Example ```Python #!/usr/bin/env python import xarray as xr import numpy as np import glob import multiprocessing # Create toy DataArrays temperature = np.array([[273.15,220.2,255.5],[221.1,260.1,270.5]]) humidity = np.array([[70.2,85.4,29.6],[30.3,55.4,100.0]]) da1 = xr.DataArray(temperature,dims=['y0','x0'],coords={'y0':np.array([0,1]),'x0':np.array([0,1,2])}) da2 = xr.DataArray(humidity,dims=['y0','x0'],coords={'y0':np.array([0,1]),'x0':np.array([0,1,2])}) # Create a toy Dataset ds = xr.Dataset({'TEMP_K':da1,'RELHUM':da2}) # Write the toy Dataset to disk ds.to_netcdf('xarray_pickle_dataset.nc') # Function to use with open_mfdataset def preprocess(ds): ds = ds.rename({'TEMP_K':'temp_k'}) return(ds) # Function for using with multiprocessing def calc_stats(ds,stat_name): if stat_name=='mean': return(ds.mean(dim=['y0']).to_dataframe()) # Get a pool of workers mp = multiprocessing.Pool(5) # Glob for the file ncfiles = glob.glob('xarray*.nc') # Can we call open_mfdataset() on a ds in memory? #datasets = [xr.open_dataset(x) for x in ncfiles] datasets = [xr.open_mfdataset([x],preprocess=preprocess) for x in ncfiles] # TEST 1: ERROR results = mp.starmap(calc_stats,[(ds,'mean') for ds in datasets]) print(results) # TEST 2: PASS #results = mp.starmap(calc_stats,[(ds[['temp_k','RELHUM']],'mean') for ds in datasets]) #print(results) # TEST 3: ERROR #results = mp.starmap(calc_stats,[(ds.isel(x0=0),'mean') for ds in datasets]) #print(results) # TEST 4: PASS #results = mp.starmap(calc_stats,[(ds.where(ds.RELHUM>80.0),'mean') for ds in datasets]) #print(results) # TEST 5: ERROR #results = mp.starmap(calc_stats,[(ds.sel(x0=slice(0,1,1)),'mean') for ds in datasets]) #print(results) ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. ### Relevant log output ```Python Traceback (most recent call last): File ""/d1/git/xarray_pickle_dataset.py"", line 35, in results = mp.starmap(calc_stats,[(ds,'mean') for ds in datasets]) File ""/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py"", line 372, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File ""/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py"", line 771, in get raise self._value File ""/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py"", line 537, in _handle_tasks put(task) File ""/home/.conda/envs/icing/lib/python3.9/multiprocessing/connection.py"", line 211, in send self._send_bytes(_ForkingPickler.dumps(obj)) File ""/home/.conda/envs/icing/lib/python3.9/multiprocessing/reduction.py"", line 51, in dumps cls(buf, protocol).dump(obj) AttributeError: Can't pickle local object 'open_mfdataset..multi_file_closer' ``` ### Anything else we need to know? Not shown in the verifiable example was another way I was able to get it to work, which looked like this: ``` results = mp.starmap(calc_stats,[(ds.sel(x0=ds.xvalues,y0=ds.yvalues),'mean') for ds in datasets]) print(results) ``` I can only assume that under the hood passing `ds.xvalues` (a 1D DataArray within the Dataset) to `sel` is transforming the Dataset enough to avoid the pickling error. The error does NOT occur when using `open_dataset`, eg: ```datasets = [xr.open_dataset(x) for x in ncfiles]``` will work. However, in my workflow I would prefer to use `open_mfdataset` to perform some preprocessing using `preprocess` even though I am only opening one Dataset at a time. ### Environment
xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 4.19.0-21-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.22.3 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.05.0 distributed: 2022.5.0 matplotlib: 3.5.1 cartopy: 0.20.2 seaborn: None numbagg: None fsspec: 2022.3.0 cupy: None pint: 0.19.2 sparse: None setuptools: 62.1.0 pip: 22.0.4 conda: None pytest: None IPython: None sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7109/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1394947888,PR_kwDOAMm_X85AESPp,7116,Fix pickling of Datasets created using open_mfdataset,18426352,closed,0,,,2,2022-10-03T15:42:41Z,2022-10-11T16:44:35Z,2022-10-11T16:44:35Z,CONTRIBUTOR,,0,pydata/xarray/pulls/7116," - [x] Closes #7109 - [x] Tests added - [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [ ] New functions/methods are listed in `api.rst` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7116/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull