github: issues: 3 rows where repo = 13221727, type = "issue" and user = 691772 sorted by updated

3 rows where repo = 13221727, type = "issue" and user = 691772 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	closed_at	author_association	body	reactions	state_reason	repo	type
1379372915	I_kwDOAMm_X85SN49z	7059	pandas.errors.InvalidIndexError raised when running computation in parallel using dask	lumbric 691772	open	8	2022-09-20T12:52:16Z	2024-03-02T16:43:15Z		CONTRIBUTOR	What happened? I'm doing a computation using chunks and `map_blocks()` to run things in parallel. At some point a `pandas.errors.InvalidIndexError` is raised. When using dask's synchronous scheduler, everything works fine. I think `pandas.core.indexes.base.Index` is not thread-safe. At least this seems to be the place of the race condition. See further tests below. (This issue was initially discussed in #6816, but the ticket was closed, because I couldn't reproduce the problem any longer. Now it seems to be reproducible in every run, so it is time for a proper bug report, which is this ticket here.) What did you expect to happen? Dask schedulers `single-threaded` and `threads` should have the same result. Minimal Complete Verifiable Example 1 Edit: I've managed to reduce the verifiable example, see example 2 below. ```Python I wasn't able to reproduce the issue with a smaller code example, so I provide all my code and my test data. This should make it possible to reproduce the issue in less than a minute. Requirements: - git - mamba, see https://github.com/mamba-org/mamba git clone https://github.com/lumbric/reproduce_invalidindexerror.git cd reproduce_invalidindexerror mamba env create -f env.yml alternatively run the following, will install latest versions from conda-forge: conda create -n reproduce_invalidindexerror conda activate reproduce_invalidindexerror mamba install -c conda-forge python=3.8 matplotlib pytest-cov dask openpyxl pytest pip xarray netcdf4 jupyter pandas scipy flake8 dvc pre-commit pyarrow statsmodels rasterio scikit-learn pytest-watch pdbpp black seaborn conda activate reproduce_invalidindexerror dvc repro checks_simulation ``` Minimal Complete Verifiable Example 2 ```Python import numpy as np import pandas as pd import xarray as xr from multiprocessing import Lock from dask.diagnostics import ProgressBar Workaround for xarray#6816: Parallel execution causes often an InvalidIndexError https://github.com/pydata/xarray/issues/6816#issuecomment-1243864752 import dask dask.config.set(scheduler="single-threaded") def generate_netcdf_files(): fnames = [f"{i:02d}.nc" for i in range(21)] for i, fname in enumerate(fnames): xr.DataArray( np.ones((3879, 48)), dims=("locations", "time"), coords={ "time": pd.date_range(f"{2000 + i}-01-01", periods=48, freq="D"), "locations": np.arange(3879), }, ).to_netcdf(fname) return fnames def compute(locations, data): def resample_annually(data): return data.sortby("time").resample(time="1A", label="left", loffset="1D").mean(dim="time") `def worker(data): locations_chunk = locations.sel(locations=data.locations) out_raw = data * locations_chunk out = resample_annually(out_raw) return out template = resample_annually(data) out = xr.map_blocks( lambda data: worker(data).compute().chunk({"time": None}), data, template=template, ) return out` def main(): fnames = generate_netcdf_files() `locations = xr.DataArray( np.ones(3879), dims="locations", coords={"locations": np.arange(3879)}, ) data = xr.open_mfdataset( fnames, combine="by_coords", chunks={"locations": 4000, "time": None}, # suggested as solution in # lock=Lock(), ).__xarray_dataarray_variable__ out = compute(locations, data) with ProgressBar(): out = out.compute()` if name == "main": main() ``` MVCE confirmation [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. [x] Complete example — the example is self-contained, including all data and the text of any traceback. [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result. [X] New issue — a search of GitHub Issues suggests this is not a duplicate. Relevant log output This is the traceback of "Minimal Complete Verifiable Example 1". Python Traceback (most recent call last): File "scripts/calc_p_out_model.py", line 61, in <module> main() File "scripts/calc_p_out_model.py", line 31, in main calc_power(name="p_out_model", compute_func=compute_func) File "/tmp/reproduce_invalidindexerror/src/wind_power.py", line 136, in calc_power power = power.compute() File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py", line 993, in compute return new.load(kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py", line 967, in load ds = self._to_temp_dataset().load(kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataset.py", line 733, in load evaluated_data = da.compute(lazy_data.values(), kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/base.py", line 600, in compute results = schedule(dsk, keys, kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/threaded.py", line 89, in get results = get_async( File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py", line 511, in get_async raise_exception(exc, tb) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py", line 319, in reraise raise exc File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task result = _execute_task(task, data) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task return func((_execute_task(a, cache) for a in args)) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py", line 119, in <genexpr> return func((_execute_task(a, cache) for a in args)) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task return func((_execute_task(a, cache) for a in args)) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/parallel.py", line 285, in _wrapper result = func(converted_args, kwargs) File "/tmp/reproduce_invalidindexerror/src/wind_power.py", line 100, in <lambda> lambda wind_speeds: worker(wind_speeds).compute().chunk({"time": None}), File "/tmp/reproduce_invalidindexerror/src/wind_power.py", line 50, in worker specific_power_chunk = specific_power.sel(turbines=wind_speeds.turbines) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py", line 1420, in sel ds = self._to_temp_dataset().sel( File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataset.py", line 2533, in sel query_results = map_index_queries( File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexing.py", line 183, in map_index_queries results.append(index.sel(labels, *options)) # type: ignore[call-arg] File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexes.py", line 418, in sel indexer = get_indexer_nd(self.index, label_array, method, tolerance) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexes.py", line 212, in get_indexer_nd flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3729, in get_indexer raise InvalidIndexError(self._requires_unique_msg) pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects Anything else we need to know? Workaround: Use synchronous dask scheduler The issue does not occur if I use the synchronous dask scheduler by adding at the very beginning of my script: `dask.config.set(scheduler='single-threaded')` Additional debugging print If I add the following debugging print to the pandas code: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,6 @@ self._check_indexing_method(method, limit, tolerance) `if not self._index_as_unique:` print("Original: ", len(self), ", length of set:", len(set(self))) raise InvalidIndexError(self._requires_unique_msg) `if len(target) == 0` ``` ...I get the following output: `Original: 3879 , length of set: 3879` So the index seems to be unique, but `self.is_unique` is `False` for some reason (note that `not self._index_as_unique` and `self.is_unique` is the same in this case). Proof of race condtion: addd sleep 1s To confirm that the race condition is at this point we wait for 1s and then check again for uniqueness: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,10 @@ self._check_indexing_method(method, limit, tolerance) `if not self._index_as_unique:` if not self.is_unique: import time time.sleep(1) print("now unique?", self.is_unique) raise InvalidIndexError(self._requires_unique_msg) ``` This outputs: `now unique? True` Environment INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.4.0-125-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.7.3 xarray: 0.15.0 pandas: 0.25.3 numpy: 1.17.4 scipy: 1.3.3 netCDF4: 1.5.3 pydap: None h5netcdf: 0.7.1 h5py: 2.10.0 Nio: None zarr: 2.4.0+ds cftime: 1.1.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: None bottleneck: 1.2.1 dask: 2.8.1+dfsg distributed: None matplotlib: 3.1.2 cartopy: None seaborn: 0.10.0 numbagg: None setuptools: 45.2.0 pip3: None conda: None pytest: 4.6.9 IPython: 7.13.0 sphinx: 1.8.5	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7059/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	issue
1315111684	I_kwDOAMm_X85OYwME	6816	pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks()	lumbric 691772	closed	5	2022-07-22T14:56:41Z	2022-09-13T09:39:48Z	2022-08-19T14:06:09Z	CONTRIBUTOR	What is your issue? I'm doing a lengthy computation, which involves hundreds of GB of data using chunks and map_blocks() so that things fit into RAM and can be done in parallel. From time to time, the following error is raised: `pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects` The line where this takes place looks pretty harmless: `x = a * b.sel(c=d.c)` It's a line inside the function `func` which is passed to a `map_blocks()` call. In this case `a` and `b` are `xr.DataArray` or `xr.DataSet` objects shadowed from outer scope and `d` is the parameter `obj` for `map_blocks()`. That means, the line below in the traceback looks like this: `xr.map_blocks( lambda d: worker(d).compute().chunk({"time": None}), d, template=template)` I guess it's some kind of race condition, since it's not 100% reproducible, but I have no idea how to further investigate the issue to create a proper bug report or fix my code. Do you have any hint how I could continue building a minimal example or so in such a case? What does the error message want to tell me?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6816/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
438389323	MDU6SXNzdWU0MzgzODkzMjM=	2928	Dask outputs warning: "The da.atop function has moved to da.blockwise"	lumbric 691772	closed	4	2019-04-29T15:59:31Z	2019-07-12T15:56:29Z	2019-07-12T15:56:28Z	CONTRIBUTOR	Problem description dask 1.1.0 moved `atop()` to `blockwise()` and introduced a warning when `atop()` is used. Related upstream ticket and PR of dask change: dask/dask#4348 dask/dask#4035 the warning in the dask documentation in an xarray example, probably not on purpose warnings have been already discussed in #2727, but not fixed there same issue in a different project: pytroll/satpy#608 Code Sample ```python import numpy as np import xarray as xr xr.DataArray(np.ones(1000)) d = xr.DataArray(np.ones(1000)) d.to_netcdf('/tmp/ones.nc') d = xr.open_dataarray('/tmp/ones.nc', chunks=10) xr.apply_ufunc(lambda x: 42 * x, d, dask='parallelized', output_dtypes=[np.float64]) ``` This outputs the warning: `...lib/python3.7/site-packages/dask/array/blockwise.py:204: UserWarning: The da.atop function has moved to da.blockwise warnings.warn("The da.atop function has moved to da.blockwise")` Expected Output No warning. As user of a recent version of dask and xarray, there shouldn't be any warnings if everything is done right. The warning should be tackled inside xarray somehow. Solution Not sure, can xarray break compatibility with dask <1.1.0 with some future version? Otherwise I guess there needs to be some legacy code in xarray which calls the right function. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.18.0-17-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.1 xarray: 0.12.1 pandas: 0.24.2 numpy: 1.16.3 scipy: 1.2.1 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 1.2.0 distributed: 1.27.0 matplotlib: 3.0.3 cartopy: None seaborn: 0.9.0 setuptools: 41.0.0 pip: 19.1 conda: None pytest: 4.4.1 IPython: 7.5.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2928/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

3 rows where repo = 13221727, type = "issue" and user = 691772 sorted by updated_at descending

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example 1

I wasn't able to reproduce the issue with a smaller code example, so I provide all my code and my test data. This should make it possible to reproduce the issue in less than a minute.

Requirements:

- git

- mamba, see https://github.com/mamba-org/mamba

alternatively run the following, will install latest versions from conda-forge:

conda create -n reproduce_invalidindexerror

conda activate reproduce_invalidindexerror

mamba install -c conda-forge python=3.8 matplotlib pytest-cov dask openpyxl pytest pip xarray netcdf4 jupyter pandas scipy flake8 dvc pre-commit pyarrow statsmodels rasterio scikit-learn pytest-watch pdbpp black seaborn

Minimal Complete Verifiable Example 2

Workaround for xarray#6816: Parallel execution causes often an InvalidIndexError

https://github.com/pydata/xarray/issues/6816#issuecomment-1243864752

import dask

dask.config.set(scheduler="single-threaded")

MVCE confirmation

Relevant log output

Anything else we need to know?

Workaround: Use synchronous dask scheduler

Additional debugging print

Proof of race condtion: addd sleep 1s

Environment

What is your issue?

Problem description

Related

Code Sample

Expected Output

Solution

Output of xr.show_versions()

Advanced export

Output of `xr.show_versions()`