id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1379372915,I_kwDOAMm_X85SN49z,7059,pandas.errors.InvalidIndexError raised when running computation in parallel using dask,691772,open,0,,,8,2022-09-20T12:52:16Z,2024-03-02T16:43:15Z,,CONTRIBUTOR,,,,"### What happened? I'm doing a computation using chunks and `map_blocks()` to run things in parallel. At some point a `pandas.errors.InvalidIndexError` is raised. When using dask's synchronous scheduler, everything works fine. I think `pandas.core.indexes.base.Index` is not thread-safe. At least this seems to be the place of the race condition. See further tests below. (This issue was initially discussed in #6816, but the ticket was closed, because I couldn't reproduce the problem any longer. Now it seems to be reproducible in every run, so it is time for a proper bug report, which is this ticket here.) ### What did you expect to happen? Dask schedulers `single-threaded` and `threads` should have the same result. ### Minimal Complete Verifiable Example 1 *Edit:* I've managed to reduce the verifiable example, see example 2 below. ```Python # I wasn't able to reproduce the issue with a smaller code example, so I provide all my code and my test data. This should make it possible to reproduce the issue in less than a minute. # Requirements: # - git # - mamba, see https://github.com/mamba-org/mamba git clone https://github.com/lumbric/reproduce_invalidindexerror.git cd reproduce_invalidindexerror mamba env create -f env.yml # alternatively run the following, will install latest versions from conda-forge: # conda create -n reproduce_invalidindexerror # conda activate reproduce_invalidindexerror # mamba install -c conda-forge python=3.8 matplotlib pytest-cov dask openpyxl pytest pip xarray netcdf4 jupyter pandas scipy flake8 dvc pre-commit pyarrow statsmodels rasterio scikit-learn pytest-watch pdbpp black seaborn conda activate reproduce_invalidindexerror dvc repro checks_simulation ``` ### Minimal Complete Verifiable Example 2 ```Python import numpy as np import pandas as pd import xarray as xr from multiprocessing import Lock from dask.diagnostics import ProgressBar # Workaround for xarray#6816: Parallel execution causes often an InvalidIndexError # https://github.com/pydata/xarray/issues/6816#issuecomment-1243864752 # import dask # dask.config.set(scheduler=""single-threaded"") def generate_netcdf_files(): fnames = [f""{i:02d}.nc"" for i in range(21)] for i, fname in enumerate(fnames): xr.DataArray( np.ones((3879, 48)), dims=(""locations"", ""time""), coords={ ""time"": pd.date_range(f""{2000 + i}-01-01"", periods=48, freq=""D""), ""locations"": np.arange(3879), }, ).to_netcdf(fname) return fnames def compute(locations, data): def resample_annually(data): return data.sortby(""time"").resample(time=""1A"", label=""left"", loffset=""1D"").mean(dim=""time"") def worker(data): locations_chunk = locations.sel(locations=data.locations) out_raw = data * locations_chunk out = resample_annually(out_raw) return out template = resample_annually(data) out = xr.map_blocks( lambda data: worker(data).compute().chunk({""time"": None}), data, template=template, ) return out def main(): fnames = generate_netcdf_files() locations = xr.DataArray( np.ones(3879), dims=""locations"", coords={""locations"": np.arange(3879)}, ) data = xr.open_mfdataset( fnames, combine=""by_coords"", chunks={""locations"": 4000, ""time"": None}, # suggested as solution in # lock=Lock(), ).__xarray_dataarray_variable__ out = compute(locations, data) with ProgressBar(): out = out.compute() if __name__ == ""__main__"": main() ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [x] Complete example — the example is self-contained, including all data and the text of any traceback. - [x] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. ### Relevant log output This is the traceback of ""Minimal Complete Verifiable Example 1"". ```Python Traceback (most recent call last): File ""scripts/calc_p_out_model.py"", line 61, in main() File ""scripts/calc_p_out_model.py"", line 31, in main calc_power(name=""p_out_model"", compute_func=compute_func) File ""/tmp/reproduce_invalidindexerror/src/wind_power.py"", line 136, in calc_power power = power.compute() File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 993, in compute return new.load(**kwargs) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 967, in load ds = self._to_temp_dataset().load(**kwargs) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataset.py"", line 733, in load evaluated_data = da.compute(*lazy_data.values(), **kwargs) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/base.py"", line 600, in compute results = schedule(dsk, keys, **kwargs) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/threaded.py"", line 89, in get results = get_async( File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py"", line 511, in get_async raise_exception(exc, tb) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py"", line 319, in reraise raise exc File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py"", line 224, in execute_task result = _execute_task(task, data) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py"", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py"", line 119, in return func(*(_execute_task(a, cache) for a in args)) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py"", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/parallel.py"", line 285, in _wrapper result = func(*converted_args, **kwargs) File ""/tmp/reproduce_invalidindexerror/src/wind_power.py"", line 100, in lambda wind_speeds: worker(wind_speeds).compute().chunk({""time"": None}), File ""/tmp/reproduce_invalidindexerror/src/wind_power.py"", line 50, in worker specific_power_chunk = specific_power.sel(turbines=wind_speeds.turbines) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 1420, in sel ds = self._to_temp_dataset().sel( File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataset.py"", line 2533, in sel query_results = map_index_queries( File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexing.py"", line 183, in map_index_queries results.append(index.sel(labels, **options)) # type: ignore[call-arg] File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexes.py"", line 418, in sel indexer = get_indexer_nd(self.index, label_array, method, tolerance) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexes.py"", line 212, in get_indexer_nd flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance) File ""/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py"", line 3729, in get_indexer raise InvalidIndexError(self._requires_unique_msg) pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects ``` ### Anything else we need to know? ### Workaround: Use synchronous dask scheduler The issue does not occur if I use the synchronous dask scheduler by adding at the very beginning of my script: `dask.config.set(scheduler='single-threaded')` ### Additional debugging print If I add the following debugging print to the pandas code: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,6 @@ self._check_indexing_method(method, limit, tolerance) if not self._index_as_unique: + print(""Original: "", len(self), "", length of set:"", len(set(self))) raise InvalidIndexError(self._requires_unique_msg) if len(target) == 0 ``` ...I get the following output: ``` Original: 3879 , length of set: 3879 ``` So the index seems to be unique, but `self.is_unique` is `False` for some reason (note that `not self._index_as_unique` and `self.is_unique` is the same in this case). ### Proof of race condtion: addd sleep 1s To confirm that the race condition is at this point we wait for 1s and then check again for uniqueness: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,10 @@ self._check_indexing_method(method, limit, tolerance) if not self._index_as_unique: + if not self.is_unique: + import time + time.sleep(1) + print(""now unique?"", self.is_unique) raise InvalidIndexError(self._requires_unique_msg) ``` This outputs: ``` now unique? True ``` ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.4.0-125-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.7.3 xarray: 0.15.0 pandas: 0.25.3 numpy: 1.17.4 scipy: 1.3.3 netCDF4: 1.5.3 pydap: None h5netcdf: 0.7.1 h5py: 2.10.0 Nio: None zarr: 2.4.0+ds cftime: 1.1.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: None bottleneck: 1.2.1 dask: 2.8.1+dfsg distributed: None matplotlib: 3.1.2 cartopy: None seaborn: 0.10.0 numbagg: None setuptools: 45.2.0 pip3: None conda: None pytest: 4.6.9 IPython: 7.13.0 sphinx: 1.8.5
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7059/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 1315111684,I_kwDOAMm_X85OYwME,6816,pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks(),691772,closed,0,,,5,2022-07-22T14:56:41Z,2022-09-13T09:39:48Z,2022-08-19T14:06:09Z,CONTRIBUTOR,,,,"### What is your issue? I'm doing a lengthy computation, which involves hundreds of GB of data using chunks and map_blocks() so that things fit into RAM and can be done in parallel. From time to time, the following error is raised: `pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects` The line where this takes place looks pretty harmless: x = a * b.sel(c=d.c) It's a line inside the function `func` which is passed to a `map_blocks()` call. In this case `a` and `b` are `xr.DataArray` or `xr.DataSet` objects shadowed from outer scope and `d` is the parameter `obj` for `map_blocks()`. That means, the line below in the traceback looks like this: xr.map_blocks( lambda d: worker(d).compute().chunk({""time"": None}), d, template=template) I guess it's some kind of race condition, since it's not 100% reproducible, but I have no idea how to further investigate the issue to create a proper bug report or fix my code. Do you have any hint how I could continue building a minimal example or so in such a case? What does the error message want to tell me?","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6816/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 467494277,MDExOlB1bGxSZXF1ZXN0Mjk3MTM2MjEz,3104,Fix minor typos in documentation,691772,closed,0,,,2,2019-07-12T16:13:15Z,2019-07-12T16:53:28Z,2019-07-12T16:51:54Z,CONTRIBUTOR,,0,pydata/xarray/pulls/3104,,"{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3104/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 467482848,MDExOlB1bGxSZXF1ZXN0Mjk3MTI2ODgw,3103,Add missing assert to unit test,691772,closed,0,,,1,2019-07-12T15:46:20Z,2019-07-12T16:35:16Z,2019-07-12T16:35:16Z,CONTRIBUTOR,,0,pydata/xarray/pulls/3103,Stumbled upon a unit test which didn't test anything.,"{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3103/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 438389323,MDU6SXNzdWU0MzgzODkzMjM=,2928,"Dask outputs warning: ""The da.atop function has moved to da.blockwise""",691772,closed,0,,,4,2019-04-29T15:59:31Z,2019-07-12T15:56:29Z,2019-07-12T15:56:28Z,CONTRIBUTOR,,,,"#### Problem description [dask 1.1.0](https://github.com/dask/dask/pull/4348) moved `atop()` to `blockwise()` and introduced a warning when `atop()` is used. #### Related * upstream ticket and PR of dask change: dask/dask#4348 dask/dask#4035 * the warning in the [dask documentation](https://examples.dask.org/xarray.html#Custom-workflows-and-automatic-parallelization) in an xarray example, probably not on purpose * warnings have been already discussed in #2727, but not fixed there * same issue in a different project: pytroll/satpy#608 #### Code Sample ```python import numpy as np import xarray as xr xr.DataArray(np.ones(1000)) d = xr.DataArray(np.ones(1000)) d.to_netcdf('/tmp/ones.nc') d = xr.open_dataarray('/tmp/ones.nc', chunks=10) xr.apply_ufunc(lambda x: 42 * x, d, dask='parallelized', output_dtypes=[np.float64]) ``` This outputs the warning: ``` ...lib/python3.7/site-packages/dask/array/blockwise.py:204: UserWarning: The da.atop function has moved to da.blockwise warnings.warn(""The da.atop function has moved to da.blockwise"") ``` #### Expected Output No warning. As user of a recent version of dask and xarray, there shouldn't be any warnings if everything is done right. The warning should be tackled inside xarray somehow. #### Solution Not sure, can xarray break compatibility with dask <1.1.0 with some future version? Otherwise I guess there needs to be some legacy code in xarray which calls the right function. #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.18.0-17-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.1 xarray: 0.12.1 pandas: 0.24.2 numpy: 1.16.3 scipy: 1.2.1 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 1.2.0 distributed: 1.27.0 matplotlib: 3.0.3 cartopy: None seaborn: 0.9.0 setuptools: 41.0.0 pip: 19.1 conda: None pytest: 4.4.1 IPython: 7.5.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2928/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 434444058,MDExOlB1bGxSZXF1ZXN0MjcxNDM1NjU4,2904,Minor improvement of docstring for Dataset,691772,closed,0,,,6,2019-04-17T19:16:50Z,2019-04-17T20:09:26Z,2019-04-17T20:08:46Z,CONTRIBUTOR,,0,pydata/xarray/pulls/2904,"This might help to avoid confusion. data_vars is always a mapping, not a mapping, a variable or a tuple. Passing just a tuple, does not work of course. But for xarray newbies, this might be less obvious and the error message is also not easy to interpret: ``` >>> xr.Dataset(('dim1', np.ones(5))) ... TypeError: unhashable type: 'numpy.ndarray' ``` The correct version of the example above should be: ``` >>> xr.Dataset({'myvar': ('dim1', np.ones(5))}) Dimensions: (dim1: 5) Dimensions without coordinates: dim1 Data variables: myvar (dim1) float64 1.0 1.0 1.0 1.0 1.0 ```","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2904/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 434439562,MDExOlB1bGxSZXF1ZXN0MjcxNDMyMTc5,2903,Fix minor typos in docstrings,691772,closed,0,,,1,2019-04-17T19:05:47Z,2019-04-17T19:15:10Z,2019-04-17T19:15:10Z,CONTRIBUTOR,,0,pydata/xarray/pulls/2903,"See also pull-request #2860 - the same typo was at many places. Sorry, I have missed the other places when sending the first PR.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2903/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 427604384,MDExOlB1bGxSZXF1ZXN0MjY2MTYyNTQw,2860,Fix minor typo in docstring,691772,closed,0,,,1,2019-04-01T09:35:02Z,2019-04-01T11:18:40Z,2019-04-01T11:18:29Z,CONTRIBUTOR,,0,pydata/xarray/pulls/2860,,"{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2860/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 389685381,MDExOlB1bGxSZXF1ZXN0MjM3NjE4MzYx,2598,Fix wrong error message in interp(),691772,closed,0,,,2,2018-12-11T10:09:53Z,2018-12-11T19:29:03Z,2018-12-11T19:29:03Z,CONTRIBUTOR,,0,pydata/xarray/pulls/2598,"This is just a minor fix of a wrong error message. Please let me know if you think that this is worth testing in unit tests. Before: ``` >>> import xarray as xr >>> d = xr.DataArray([1,2,3]) >>> d.interp(1) ... ValueError: the first argument to .rename must be a dictionary ``` After: ``` >>> import xarray as xr >>> d = xr.DataArray([1,2,3]) >>> d.interp(1) ... ValueError: the first argument to .interp must be a dictionary ```","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2598/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull