issues: 1391738128
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1391738128 | I_kwDOAMm_X85S9D0Q | 7109 | Multiprocessing unable to pickle Dataset opened with open_mfdataset | 18426352 | closed | 0 | 4 | 2022-09-30T02:43:43Z | 2022-10-11T16:44:36Z | 2022-10-11T16:44:35Z | CONTRIBUTOR | What happened?When passing a Dataset object opened using What did you expect to happen?I expected the Dataset to be handed off to the function via multiprocessing without error. I can remove the error by using variable subsetting or other reduction, like via Minimal Complete Verifiable Example```Python !/usr/bin/env pythonimport xarray as xr import numpy as np import glob import multiprocessing Create toy DataArraystemperature = np.array([[273.15,220.2,255.5],[221.1,260.1,270.5]]) humidity = np.array([[70.2,85.4,29.6],[30.3,55.4,100.0]]) da1 = xr.DataArray(temperature,dims=['y0','x0'],coords={'y0':np.array([0,1]),'x0':np.array([0,1,2])}) da2 = xr.DataArray(humidity,dims=['y0','x0'],coords={'y0':np.array([0,1]),'x0':np.array([0,1,2])}) Create a toy Datasetds = xr.Dataset({'TEMP_K':da1,'RELHUM':da2}) Write the toy Dataset to diskds.to_netcdf('xarray_pickle_dataset.nc') Function to use with open_mfdatasetdef preprocess(ds): ds = ds.rename({'TEMP_K':'temp_k'}) return(ds) Function for using with multiprocessingdef calc_stats(ds,stat_name): if stat_name=='mean': return(ds.mean(dim=['y0']).to_dataframe()) Get a pool of workersmp = multiprocessing.Pool(5) Glob for the filencfiles = glob.glob('xarray*.nc') Can we call open_mfdataset() on a ds in memory?datasets = [xr.open_dataset(x) for x in ncfiles]datasets = [xr.open_mfdataset([x],preprocess=preprocess) for x in ncfiles] TEST 1: ERRORresults = mp.starmap(calc_stats,[(ds,'mean') for ds in datasets]) print(results) TEST 2: PASSresults = mp.starmap(calc_stats,[(ds[['temp_k','RELHUM']],'mean') for ds in datasets])print(results)TEST 3: ERRORresults = mp.starmap(calc_stats,[(ds.isel(x0=0),'mean') for ds in datasets])print(results)TEST 4: PASSresults = mp.starmap(calc_stats,[(ds.where(ds.RELHUM>80.0),'mean') for ds in datasets])print(results)TEST 5: ERRORresults = mp.starmap(calc_stats,[(ds.sel(x0=slice(0,1,1)),'mean') for ds in datasets])print(results)``` MVCE confirmation
Relevant log output
Anything else we need to know?Not shown in the verifiable example was another way I was able to get it to work, which looked like this:
The error does NOT occur when using Environment
xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59)
[GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 4.19.0-21-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1
xarray: 2022.3.0
pandas: 1.4.2
numpy: 1.22.3
scipy: 1.8.0
netCDF4: 1.5.8
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.05.0
distributed: 2022.5.0
matplotlib: 3.5.1
cartopy: 0.20.2
seaborn: None
numbagg: None
fsspec: 2022.3.0
cupy: None
pint: 0.19.2
sparse: None
setuptools: 62.1.0
pip: 22.0.4
conda: None
pytest: None
IPython: None
sphinx: None
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/7109/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |