home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

2 rows where type = "issue" and user = 18426352 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 1

  • issue · 2 ✖

state 1

  • closed 2

repo 1

  • xarray 2
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1391738128 I_kwDOAMm_X85S9D0Q 7109 Multiprocessing unable to pickle Dataset opened with open_mfdataset DanielAdriaansen 18426352 closed 0     4 2022-09-30T02:43:43Z 2022-10-11T16:44:36Z 2022-10-11T16:44:35Z CONTRIBUTOR      

What happened?

When passing a Dataset object opened using open_mfdataset to a function via Python's mutliprocessing.Pool module, I received the following error: AttributeError: Can't pickle local object 'open_mfdataset.<locals>.multi_file_closer

What did you expect to happen?

I expected the Dataset to be handed off to the function via multiprocessing without error. I can remove the error by using variable subsetting or other reduction, like via where, so I don't understand why the original Dataset object returned from open_mfdataset cannot be used.

Minimal Complete Verifiable Example

```Python

!/usr/bin/env python

import xarray as xr import numpy as np import glob import multiprocessing

Create toy DataArrays

temperature = np.array([[273.15,220.2,255.5],[221.1,260.1,270.5]]) humidity = np.array([[70.2,85.4,29.6],[30.3,55.4,100.0]]) da1 = xr.DataArray(temperature,dims=['y0','x0'],coords={'y0':np.array([0,1]),'x0':np.array([0,1,2])}) da2 = xr.DataArray(humidity,dims=['y0','x0'],coords={'y0':np.array([0,1]),'x0':np.array([0,1,2])})

Create a toy Dataset

ds = xr.Dataset({'TEMP_K':da1,'RELHUM':da2})

Write the toy Dataset to disk

ds.to_netcdf('xarray_pickle_dataset.nc')

Function to use with open_mfdataset

def preprocess(ds): ds = ds.rename({'TEMP_K':'temp_k'}) return(ds)

Function for using with multiprocessing

def calc_stats(ds,stat_name): if stat_name=='mean': return(ds.mean(dim=['y0']).to_dataframe())

Get a pool of workers

mp = multiprocessing.Pool(5)

Glob for the file

ncfiles = glob.glob('xarray*.nc')

Can we call open_mfdataset() on a ds in memory?

datasets = [xr.open_dataset(x) for x in ncfiles]

datasets = [xr.open_mfdataset([x],preprocess=preprocess) for x in ncfiles]

TEST 1: ERROR

results = mp.starmap(calc_stats,[(ds,'mean') for ds in datasets]) print(results)

TEST 2: PASS

results = mp.starmap(calc_stats,[(ds[['temp_k','RELHUM']],'mean') for ds in datasets])

print(results)

TEST 3: ERROR

results = mp.starmap(calc_stats,[(ds.isel(x0=0),'mean') for ds in datasets])

print(results)

TEST 4: PASS

results = mp.starmap(calc_stats,[(ds.where(ds.RELHUM>80.0),'mean') for ds in datasets])

print(results)

TEST 5: ERROR

results = mp.starmap(calc_stats,[(ds.sel(x0=slice(0,1,1)),'mean') for ds in datasets])

print(results)

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Python Traceback (most recent call last): File "/d1/git/xarray_pickle_dataset.py", line 35, in <module> results = mp.starmap(calc_stats,[(ds,'mean') for ds in datasets]) File "/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 372, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 771, in get raise self._value File "/home/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks put(task) File "/home/.conda/envs/icing/lib/python3.9/multiprocessing/connection.py", line 211, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/home/.conda/envs/icing/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) AttributeError: Can't pickle local object 'open_mfdataset.<locals>.multi_file_closer'

Anything else we need to know?

Not shown in the verifiable example was another way I was able to get it to work, which looked like this: results = mp.starmap(calc_stats,[(ds.sel(x0=ds.xvalues,y0=ds.yvalues),'mean') for ds in datasets]) print(results) I can only assume that under the hood passing ds.xvalues (a 1D DataArray within the Dataset) to sel is transforming the Dataset enough to avoid the pickling error.

The error does NOT occur when using open_dataset, eg: datasets = [xr.open_dataset(x) for x in ncfiles] will work. However, in my workflow I would prefer to use open_mfdataset to perform some preprocessing using preprocess even though I am only opening one Dataset at a time.

Environment

xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 4.19.0-21-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.22.3 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.05.0 distributed: 2022.5.0 matplotlib: 3.5.1 cartopy: 0.20.2 seaborn: None numbagg: None fsspec: 2022.3.0 cupy: None pint: 0.19.2 sparse: None setuptools: 62.1.0 pip: 22.0.4 conda: None pytest: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7109/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
774553196 MDU6SXNzdWU3NzQ1NTMxOTY= 4733 Xarray with cfgrib backend errors with .where() when drop=True DanielAdriaansen 18426352 closed 0     4 2020-12-24T20:26:58Z 2021-01-02T08:17:37Z 2021-01-02T08:17:37Z CONTRIBUTOR      

What happened: When loading a HRRR GRIBv2 file in this manner: ds = xr.open_dataset(gribfile_path,engine='cfgrib',backend_kwargs={'filter_by_keys':{'typeOfLevel':'hybrid'}})

I have trouble using the .where() method when drop=True. If I set drop=False, it works fine. I am attempting to subset via latitude and longitude like this:

ds_sub = ds.where((ds.latitude>=40.0)&(ds.latitude<=50.0)&(ds.longitude>=252.0)&(ds.longitude<=280.0),drop=True)

However I receive the following errors: Traceback (most recent call last): File "read_interp_format_grib.py", line 77, in <module> test = ds.where(mask_lat & mask_lon,drop=True) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/common.py", line 1268, in where return ops.where_method(self, cond, other) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/ops.py", line 193, in where_method return apply_ufunc( File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py", line 1092, in apply_ufunc return apply_dataset_vfunc( File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py", line 410, in apply_dataset_vfunc result_vars = apply_dict_of_variables_vfunc( File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py", line 356, in apply_dict_of_variables_vfunc result_vars[name] = func(*variable_args) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py", line 606, in apply_variable_ufunc input_data = [ File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py", line 607, in <listcomp> broadcast_compat_data(arg, broadcast_dims, core_dims) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/computation.py", line 522, in broadcast_compat_data data = variable.data File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/variable.py", line 359, in data return self.values File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/variable.py", line 510, in values return _as_array_or_item(self._data) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/variable.py", line 272, in _as_array_or_item data = np.asarray(data) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray return array(a, dtype, copy=False, order=order) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py", line 685, in __array__ self._ensure_cached() File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py", line 682, in _ensure_cached self.array = NumpyIndexingAdapter(np.asarray(self.array)) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray return array(a, dtype, copy=False, order=order) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py", line 655, in __array__ return np.asarray(self.array, dtype=dtype) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray return array(a, dtype, copy=False, order=order) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py", line 560, in __array__ return np.asarray(array[self.key], dtype=None) File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/backends/cfgrib_.py", line 23, in __getitem__ return indexing.explicit_indexing_adapter( File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py", line 848, in explicit_indexing_adapter result = NumpyIndexingAdapter(np.asarray(result))[numpy_indices] File "/d1/anaconda3/envs/era5/lib/python3.8/site-packages/xarray/core/indexing.py", line 1294, in __getitem__ return array[key] IndexError: too many indices for array: array is 2-dimensional, but 3 were indexed

What you expected to happen: I expect the dataset to be reduced in the x and y (latitude/longitude) dimensions where cond=False from .where() when drop=True.

Minimal Complete Verifiable Example:

```python

Put your MCVE code here

```

Anything else we need to know?: I was able to confirm cfgrib is where the issue lies by doing the following: ds = xr.open_dataset(gribfile_path,engine='cfgrib',backend_kwargs={'filter_by_keys':{'typeOfLevel':'hybrid'}}) ds.to_netcdf('test.nc') dsnc = xr.open_dataset('test.nc') ds_sub = dsnc.where((dsnc.latitude>=40.0)&(dsnc.latitude<=50.0)&(dsnc.longitude>=252.0)&(dsnc.longitude<=280.0),drop=True)

That correctly gives me: Dimensions: (hybrid: 50, x: 792, y: 414)

Originally x=1799 and y=1059.

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.9.0-14-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.1 scipy: 1.5.2 netCDF4: 1.5.5.1 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.8.5 iris: None bottleneck: None dask: 2.30.0 distributed: 2.30.0 matplotlib: 3.3.2 cartopy: 0.17.0 seaborn: None numbagg: None pint: 0.16.1 setuptools: 49.6.0.post20200917 pip: 20.2.3 conda: None pytest: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4733/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 1043.547ms · About: xarray-datasette