home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1391738128

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1391738128 I_kwDOAMm_X85S9D0Q 7109 Multiprocessing unable to pickle Dataset opened with open_mfdataset 18426352 closed 0     4 2022-09-30T02:43:43Z 2022-10-11T16:44:36Z 2022-10-11T16:44:35Z CONTRIBUTOR      

What happened?

When passing a Dataset object opened using open_mfdataset to a function via Python's mutliprocessing.Pool module, I received the following error: AttributeError: Can't pickle local object 'open_mfdataset.<locals>.multi_file_closer

What did you expect to happen?

I expected the Dataset to be handed off to the function via multiprocessing without error. I can remove the error by using variable subsetting or other reduction, like via where, so I don't understand why the original Dataset object returned from open_mfdataset cannot be used.

Minimal Complete Verifiable Example

```Python

!/usr/bin/env python

import xarray as xr import numpy as np import glob import multiprocessing

Create toy DataArrays

temperature = np.array([[273.15,220.2,255.5],[221.1,260.1,270.5]]) humidity = np.array([[70.2,85.4,29.6],[30.3,55.4,100.0]]) da1 = xr.DataArray(temperature,dims=['y0','x0'],coords={'y0':np.array([0,1]),'x0':np.array([0,1,2])}) da2 = xr.DataArray(humidity,dims=['y0','x0'],coords={'y0':np.array([0,1]),'x0':np.array([0,1,2])})

Create a toy Dataset

ds = xr.Dataset({'TEMP_K':da1,'RELHUM':da2})

Write the toy Dataset to disk

ds.to_netcdf('xarray_pickle_dataset.nc')

Function to use with open_mfdataset

def preprocess(ds): ds = ds.rename({'TEMP_K':'temp_k'}) return(ds)

Function for using with multiprocessing

def calc_stats(ds,stat_name): if stat_name=='mean': return(ds.mean(dim=['y0']).to_dataframe())

Get a pool of workers

mp = multiprocessing.Pool(5)

Glob for the file

ncfiles = glob.glob('xarray*.nc')

Can we call open_mfdataset() on a ds in memory?

datasets = [xr.open_dataset(x) for x in ncfiles]

datasets = [xr.open_mfdataset([x],preprocess=preprocess) for x in ncfiles]

TEST 1: ERROR

results = mp.starmap(calc_stats,[(ds,'mean') for ds in datasets]) print(results)

TEST 2: PASS

results = mp.starmap(calc_stats,[(ds[['temp_k','RELHUM']],'mean') for ds in datasets])

print(results)

TEST 3: ERROR

results = mp.starmap(calc_stats,[(ds.isel(x0=0),'mean') for ds in datasets])

print(results)

TEST 4: PASS

results = mp.starmap(calc_stats,[(ds.where(ds.RELHUM>80.0),'mean') for ds in datasets])

print(results)

TEST 5: ERROR

results = mp.starmap(calc_stats,[(ds.sel(x0=slice(0,1,1)),'mean') for ds in datasets])

print(results)

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Python Traceback (most recent call last): File "/d1/git/xarray_pickle_dataset.py", line 35, in <module> results = mp.starmap(calc_stats,[(ds,'mean') for ds in datasets]) File "/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 372, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 771, in get raise self._value File "/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks put(task) File "/home/.conda/envs/icing/lib/python3.9/multiprocessing/connection.py", line 211, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/home/.conda/envs/icing/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) AttributeError: Can't pickle local object 'open_mfdataset.<locals>.multi_file_closer'

Anything else we need to know?

Not shown in the verifiable example was another way I was able to get it to work, which looked like this: results = mp.starmap(calc_stats,[(ds.sel(x0=ds.xvalues,y0=ds.yvalues),'mean') for ds in datasets]) print(results) I can only assume that under the hood passing ds.xvalues (a 1D DataArray within the Dataset) to sel is transforming the Dataset enough to avoid the pickling error.

The error does NOT occur when using open_dataset, eg: datasets = [xr.open_dataset(x) for x in ncfiles] will work. However, in my workflow I would prefer to use open_mfdataset to perform some preprocessing using preprocess even though I am only opening one Dataset at a time.

Environment

xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 4.19.0-21-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.22.3 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.05.0 distributed: 2022.5.0 matplotlib: 3.5.1 cartopy: 0.20.2 seaborn: None numbagg: None fsspec: 2022.3.0 cupy: None pint: 0.19.2 sparse: None setuptools: 62.1.0 pip: 22.0.4 conda: None pytest: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7109/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 4 rows from issue in issue_comments
Powered by Datasette · Queries took 77.597ms · About: xarray-datasette