id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1379372915,I_kwDOAMm_X85SN49z,7059,pandas.errors.InvalidIndexError raised when running computation in parallel using dask,691772,open,0,,,8,2022-09-20T12:52:16Z,2024-03-02T16:43:15Z,,CONTRIBUTOR,,,,"### What happened? I'm doing a computation using chunks and `map_blocks()` to run things in parallel. At some point a `pandas.errors.InvalidIndexError` is raised. When using dask's synchronous scheduler, everything works fine. I think `pandas.core.indexes.base.Index` is not thread-safe. At least this seems to be the place of the race condition. See further tests below. (This issue was initially discussed in #6816, but the ticket was closed, because I couldn't reproduce the problem any longer. Now it seems to be reproducible in every run, so it is time for a proper bug report, which is this ticket here.) ### What did you expect to happen? Dask schedulers `single-threaded` and `threads` should have the same result. ### Minimal Complete Verifiable Example 1 *Edit:* I've managed to reduce the verifiable example, see example 2 below. ```Python # I wasn't able to reproduce the issue with a smaller code example, so I provide all my code and my test data. This should make it possible to reproduce the issue in less than a minute. # Requirements: # - git # - mamba, see https://github.com/mamba-org/mamba git clone https://github.com/lumbric/reproduce_invalidindexerror.git cd reproduce_invalidindexerror mamba env create -f env.yml # alternatively run the following, will install latest versions from conda-forge: # conda create -n reproduce_invalidindexerror # conda activate reproduce_invalidindexerror # mamba install -c conda-forge python=3.8 matplotlib pytest-cov dask openpyxl pytest pip xarray netcdf4 jupyter pandas scipy flake8 dvc pre-commit pyarrow statsmodels rasterio scikit-learn pytest-watch pdbpp black seaborn conda activate reproduce_invalidindexerror dvc repro checks_simulation ``` ### Minimal Complete Verifiable Example 2 ```Python import numpy as np import pandas as pd import xarray as xr from multiprocessing import Lock from dask.diagnostics import ProgressBar # Workaround for xarray#6816: Parallel execution causes often an InvalidIndexError # https://github.com/pydata/xarray/issues/6816#issuecomment-1243864752 # import dask # dask.config.set(scheduler=""single-threaded"") def generate_netcdf_files(): fnames = [f""{i:02d}.nc"" for i in range(21)] for i, fname in enumerate(fnames): xr.DataArray( np.ones((3879, 48)), dims=(""locations"", ""time""), coords={ ""time"": pd.date_range(f""{2000 + i}-01-01"", periods=48, freq=""D""), ""locations"": np.arange(3879), }, ).to_netcdf(fname) return fnames def compute(locations, data): def resample_annually(data): return data.sortby(""time"").resample(time=""1A"", label=""left"", loffset=""1D"").mean(dim=""time"") def worker(data): locations_chunk = locations.sel(locations=data.locations) out_raw = data * locations_chunk out = resample_annually(out_raw) return out template = resample_annually(data) out = xr.map_blocks( lambda data: worker(data).compute().chunk({""time"": None}), data, template=template, ) return out def main(): fnames = generate_netcdf_files() locations = xr.DataArray( np.ones(3879), dims=""locations"", coords={""locations"": np.arange(3879)}, ) data = xr.open_mfdataset( fnames, combine=""by_coords"", chunks={""locations"": 4000, ""time"": None}, # suggested as solution in # lock=Lock(), ).__xarray_dataarray_variable__ out = compute(locations, data) with ProgressBar(): out = out.compute() if __name__ == ""__main__"": main() ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [x] Complete example — the example is self-contained, including all data and the text of any traceback. - [x] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. ### Relevant log output This is the traceback of ""Minimal Complete Verifiable Example 1"". ```Python Traceback (most recent call last): File ""scripts/calc_p_out_model.py"", line 61, in main() File ""scripts/calc_p_out_model.py"", line 31, in main calc_power(name=""p_out_model"", compute_func=compute_func) File ""/tmp/reproduce_invalidindexerror/src/wind_power.py"", line 136, in calc_power power = power.compute() File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 993, in compute return new.load(**kwargs) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 967, in load ds = self._to_temp_dataset().load(**kwargs) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataset.py"", line 733, in load evaluated_data = da.compute(*lazy_data.values(), **kwargs) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/base.py"", line 600, in compute results = schedule(dsk, keys, **kwargs) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/threaded.py"", line 89, in get results = get_async( File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py"", line 511, in get_async raise_exception(exc, tb) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py"", line 319, in reraise raise exc File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py"", line 224, in execute_task result = _execute_task(task, data) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py"", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py"", line 119, in return func(*(_execute_task(a, cache) for a in args)) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py"", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/parallel.py"", line 285, in _wrapper result = func(*converted_args, **kwargs) File ""/tmp/reproduce_invalidindexerror/src/wind_power.py"", line 100, in lambda wind_speeds: worker(wind_speeds).compute().chunk({""time"": None}), File ""/tmp/reproduce_invalidindexerror/src/wind_power.py"", line 50, in worker specific_power_chunk = specific_power.sel(turbines=wind_speeds.turbines) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 1420, in sel ds = self._to_temp_dataset().sel( File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataset.py"", line 2533, in sel query_results = map_index_queries( File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexing.py"", line 183, in map_index_queries results.append(index.sel(labels, **options)) # type: ignore[call-arg] File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexes.py"", line 418, in sel indexer = get_indexer_nd(self.index, label_array, method, tolerance) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexes.py"", line 212, in get_indexer_nd flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py"", line 3729, in get_indexer raise InvalidIndexError(self._requires_unique_msg) pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects ``` ### Anything else we need to know? ### Workaround: Use synchronous dask scheduler The issue does not occur if I use the synchronous dask scheduler by adding at the very beginning of my script: `dask.config.set(scheduler='single-threaded')` ### Additional debugging print If I add the following debugging print to the pandas code: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,6 @@ self._check_indexing_method(method, limit, tolerance) if not self._index_as_unique: + print(""Original: "", len(self), "", length of set:"", len(set(self))) raise InvalidIndexError(self._requires_unique_msg) if len(target) == 0 ``` ...I get the following output: ``` Original: 3879 , length of set: 3879 ``` So the index seems to be unique, but `self.is_unique` is `False` for some reason (note that `not self._index_as_unique` and `self.is_unique` is the same in this case). ### Proof of race condtion: addd sleep 1s To confirm that the race condition is at this point we wait for 1s and then check again for uniqueness: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,10 @@ self._check_indexing_method(method, limit, tolerance) if not self._index_as_unique: + if not self.is_unique: + import time + time.sleep(1) + print(""now unique?"", self.is_unique) raise InvalidIndexError(self._requires_unique_msg) ``` This outputs: ``` now unique? True ``` ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.4.0-125-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.7.3 xarray: 0.15.0 pandas: 0.25.3 numpy: 1.17.4 scipy: 1.3.3 netCDF4: 1.5.3 pydap: None h5netcdf: 0.7.1 h5py: 2.10.0 Nio: None zarr: 2.4.0+ds cftime: 1.1.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: None bottleneck: 1.2.1 dask: 2.8.1+dfsg distributed: None matplotlib: 3.1.2 cartopy: None seaborn: 0.10.0 numbagg: None setuptools: 45.2.0 pip3: None conda: None pytest: 4.6.9 IPython: 7.13.0 sphinx: 1.8.5
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7059/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue