id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1391738128,I_kwDOAMm_X85S9D0Q,7109,Multiprocessing unable to pickle Dataset opened with open_mfdataset,18426352,closed,0,,,4,2022-09-30T02:43:43Z,2022-10-11T16:44:36Z,2022-10-11T16:44:35Z,CONTRIBUTOR,,,,"### What happened? When passing a Dataset object opened using `open_mfdataset` to a function via Python's mutliprocessing.Pool module, I received the following error: `AttributeError: Can't pickle local object 'open_mfdataset..multi_file_closer` ### What did you expect to happen? I expected the Dataset to be handed off to the function via multiprocessing without error. I can remove the error by using variable subsetting or other reduction, like via `where`, so I don't understand why the original Dataset object returned from open_mfdataset cannot be used. ### Minimal Complete Verifiable Example ```Python #!/usr/bin/env python import xarray as xr import numpy as np import glob import multiprocessing # Create toy DataArrays temperature = np.array([[273.15,220.2,255.5],[221.1,260.1,270.5]]) humidity = np.array([[70.2,85.4,29.6],[30.3,55.4,100.0]]) da1 = xr.DataArray(temperature,dims=['y0','x0'],coords={'y0':np.array([0,1]),'x0':np.array([0,1,2])}) da2 = xr.DataArray(humidity,dims=['y0','x0'],coords={'y0':np.array([0,1]),'x0':np.array([0,1,2])}) # Create a toy Dataset ds = xr.Dataset({'TEMP_K':da1,'RELHUM':da2}) # Write the toy Dataset to disk ds.to_netcdf('xarray_pickle_dataset.nc') # Function to use with open_mfdataset def preprocess(ds): ds = ds.rename({'TEMP_K':'temp_k'}) return(ds) # Function for using with multiprocessing def calc_stats(ds,stat_name): if stat_name=='mean': return(ds.mean(dim=['y0']).to_dataframe()) # Get a pool of workers mp = multiprocessing.Pool(5) # Glob for the file ncfiles = glob.glob('xarray*.nc') # Can we call open_mfdataset() on a ds in memory? #datasets = [xr.open_dataset(x) for x in ncfiles] datasets = [xr.open_mfdataset([x],preprocess=preprocess) for x in ncfiles] # TEST 1: ERROR results = mp.starmap(calc_stats,[(ds,'mean') for ds in datasets]) print(results) # TEST 2: PASS #results = mp.starmap(calc_stats,[(ds[['temp_k','RELHUM']],'mean') for ds in datasets]) #print(results) # TEST 3: ERROR #results = mp.starmap(calc_stats,[(ds.isel(x0=0),'mean') for ds in datasets]) #print(results) # TEST 4: PASS #results = mp.starmap(calc_stats,[(ds.where(ds.RELHUM>80.0),'mean') for ds in datasets]) #print(results) # TEST 5: ERROR #results = mp.starmap(calc_stats,[(ds.sel(x0=slice(0,1,1)),'mean') for ds in datasets]) #print(results) ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. ### Relevant log output ```Python Traceback (most recent call last): File ""/d1/git/xarray_pickle_dataset.py"", line 35, in results = mp.starmap(calc_stats,[(ds,'mean') for ds in datasets]) File ""/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py"", line 372, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File ""/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py"", line 771, in get raise self._value File ""/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py"", line 537, in _handle_tasks put(task) File ""/home/.conda/envs/icing/lib/python3.9/multiprocessing/connection.py"", line 211, in send self._send_bytes(_ForkingPickler.dumps(obj)) File ""/home/.conda/envs/icing/lib/python3.9/multiprocessing/reduction.py"", line 51, in dumps cls(buf, protocol).dump(obj) AttributeError: Can't pickle local object 'open_mfdataset..multi_file_closer' ``` ### Anything else we need to know? Not shown in the verifiable example was another way I was able to get it to work, which looked like this: ``` results = mp.starmap(calc_stats,[(ds.sel(x0=ds.xvalues,y0=ds.yvalues),'mean') for ds in datasets]) print(results) ``` I can only assume that under the hood passing `ds.xvalues` (a 1D DataArray within the Dataset) to `sel` is transforming the Dataset enough to avoid the pickling error. The error does NOT occur when using `open_dataset`, eg: ```datasets = [xr.open_dataset(x) for x in ncfiles]``` will work. However, in my workflow I would prefer to use `open_mfdataset` to perform some preprocessing using `preprocess` even though I am only opening one Dataset at a time. ### Environment
xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 4.19.0-21-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.22.3 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.05.0 distributed: 2022.5.0 matplotlib: 3.5.1 cartopy: 0.20.2 seaborn: None numbagg: None fsspec: 2022.3.0 cupy: None pint: 0.19.2 sparse: None setuptools: 62.1.0 pip: 22.0.4 conda: None pytest: None IPython: None sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7109/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 774553196,MDU6SXNzdWU3NzQ1NTMxOTY=,4733,Xarray with cfgrib backend errors with .where() when drop=True,18426352,closed,0,,,4,2020-12-24T20:26:58Z,2021-01-02T08:17:37Z,2021-01-02T08:17:37Z,CONTRIBUTOR,,,," **What happened**: When loading a HRRR GRIBv2 file in this manner: `ds = xr.open_dataset(gribfile_path,engine='cfgrib',backend_kwargs={'filter_by_keys':{'typeOfLevel':'hybrid'}})` I have trouble using the `.where()` method when `drop=True`. If I set `drop=False`, it works fine. I am attempting to subset via latitude and longitude like this: `ds_sub = ds.where((ds.latitude>=40.0)&(ds.latitude<=50.0)&(ds.longitude>=252.0)&(ds.longitude<=280.0),drop=True)` However I receive the following errors: ``` Traceback (most recent call last): File ""read_interp_format_grib.py"", line 77, in test = ds.where(mask_lat & mask_lon,drop=True) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/common.py"", line 1268, in where return ops.where_method(self, cond, other) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/ops.py"", line 193, in where_method return apply_ufunc( File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py"", line 1092, in apply_ufunc return apply_dataset_vfunc( File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py"", line 410, in apply_dataset_vfunc result_vars = apply_dict_of_variables_vfunc( File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py"", line 356, in apply_dict_of_variables_vfunc result_vars[name] = func(*variable_args) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py"", line 606, in apply_variable_ufunc input_data = [ File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py"", line 607, in broadcast_compat_data(arg, broadcast_dims, core_dims) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py"", line 522, in broadcast_compat_data data = variable.data File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/variable.py"", line 359, in data return self.values File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/variable.py"", line 510, in values return _as_array_or_item(self._data) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/variable.py"", line 272, in _as_array_or_item data = np.asarray(data) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/numpy/core/_asarray.py"", line 83, in asarray return array(a, dtype, copy=False, order=order) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py"", line 685, in __array__ self._ensure_cached() File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py"", line 682, in _ensure_cached self.array = NumpyIndexingAdapter(np.asarray(self.array)) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/numpy/core/_asarray.py"", line 83, in asarray return array(a, dtype, copy=False, order=order) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py"", line 655, in __array__ return np.asarray(self.array, dtype=dtype) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/numpy/core/_asarray.py"", line 83, in asarray return array(a, dtype, copy=False, order=order) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py"", line 560, in __array__ return np.asarray(array[self.key], dtype=None) File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/backends/cfgrib_.py"", line 23, in __getitem__ return indexing.explicit_indexing_adapter( File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py"", line 848, in explicit_indexing_adapter result = NumpyIndexingAdapter(np.asarray(result))[numpy_indices] File ""/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py"", line 1294, in __getitem__ return array[key] IndexError: too many indices for array: array is 2-dimensional, but 3 were indexed ``` **What you expected to happen**: I expect the dataset to be reduced in the x and y (latitude/longitude) dimensions where `cond=False` from `.where()` when `drop=True`. **Minimal Complete Verifiable Example**: ```python # Put your MCVE code here ``` **Anything else we need to know?**: I was able to confirm cfgrib is where the issue lies by doing the following: ``` ds = xr.open_dataset(gribfile_path,engine='cfgrib',backend_kwargs={'filter_by_keys':{'typeOfLevel':'hybrid'}}) ds.to_netcdf('test.nc') dsnc = xr.open_dataset('test.nc') ds_sub = dsnc.where((dsnc.latitude>=40.0)&(dsnc.latitude<=50.0)&(dsnc.longitude>=252.0)&(dsnc.longitude<=280.0),drop=True) ``` That correctly gives me: `Dimensions: (hybrid: 50, x: 792, y: 414)` Originally x=1799 and y=1059. **Environment**:
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.9.0-14-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.1 scipy: 1.5.2 netCDF4: 1.5.5.1 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.8.5 iris: None bottleneck: None dask: 2.30.0 distributed: 2.30.0 matplotlib: 3.3.2 cartopy: 0.17.0 seaborn: None numbagg: None pint: 0.16.1 setuptools: 49.6.0.post20200917 pip: 20.2.3 conda: None pytest: None IPython: None sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4733/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue