github: issues: 9 rows where repo = 13221727 and user = 691772 sorted by updated

9 rows where repo = 13221727 and user = 691772 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	closed_at	author_association	draft	pull_request	body	reactions	state_reason	repo	type
1379372915	I_kwDOAMm_X85SN49z	7059	pandas.errors.InvalidIndexError raised when running computation in parallel using dask	lumbric 691772	open	8	2022-09-20T12:52:16Z	2024-03-02T16:43:15Z		CONTRIBUTOR			What happened? I'm doing a computation using chunks and `map_blocks()` to run things in parallel. At some point a `pandas.errors.InvalidIndexError` is raised. When using dask's synchronous scheduler, everything works fine. I think `pandas.core.indexes.base.Index` is not thread-safe. At least this seems to be the place of the race condition. See further tests below. (This issue was initially discussed in #6816, but the ticket was closed, because I couldn't reproduce the problem any longer. Now it seems to be reproducible in every run, so it is time for a proper bug report, which is this ticket here.) What did you expect to happen? Dask schedulers `single-threaded` and `threads` should have the same result. Minimal Complete Verifiable Example 1 Edit: I've managed to reduce the verifiable example, see example 2 below. ```Python I wasn't able to reproduce the issue with a smaller code example, so I provide all my code and my test data. This should make it possible to reproduce the issue in less than a minute. Requirements: - git - mamba, see https://github.com/mamba-org/mamba git clone https://github.com/lumbric/reproduce_invalidindexerror.git cd reproduce_invalidindexerror mamba env create -f env.yml alternatively run the following, will install latest versions from conda-forge: conda create -n reproduce_invalidindexerror conda activate reproduce_invalidindexerror mamba install -c conda-forge python=3.8 matplotlib pytest-cov dask openpyxl pytest pip xarray netcdf4 jupyter pandas scipy flake8 dvc pre-commit pyarrow statsmodels rasterio scikit-learn pytest-watch pdbpp black seaborn conda activate reproduce_invalidindexerror dvc repro checks_simulation ``` Minimal Complete Verifiable Example 2 ```Python import numpy as np import pandas as pd import xarray as xr from multiprocessing import Lock from dask.diagnostics import ProgressBar Workaround for xarray#6816: Parallel execution causes often an InvalidIndexError https://github.com/pydata/xarray/issues/6816#issuecomment-1243864752 import dask dask.config.set(scheduler="single-threaded") def generate_netcdf_files(): fnames = [f"{i:02d}.nc" for i in range(21)] for i, fname in enumerate(fnames): xr.DataArray( np.ones((3879, 48)), dims=("locations", "time"), coords={ "time": pd.date_range(f"{2000 + i}-01-01", periods=48, freq="D"), "locations": np.arange(3879), }, ).to_netcdf(fname) return fnames def compute(locations, data): def resample_annually(data): return data.sortby("time").resample(time="1A", label="left", loffset="1D").mean(dim="time") `def worker(data): locations_chunk = locations.sel(locations=data.locations) out_raw = data * locations_chunk out = resample_annually(out_raw) return out template = resample_annually(data) out = xr.map_blocks( lambda data: worker(data).compute().chunk({"time": None}), data, template=template, ) return out` def main(): fnames = generate_netcdf_files() `locations = xr.DataArray( np.ones(3879), dims="locations", coords={"locations": np.arange(3879)}, ) data = xr.open_mfdataset( fnames, combine="by_coords", chunks={"locations": 4000, "time": None}, # suggested as solution in # lock=Lock(), ).__xarray_dataarray_variable__ out = compute(locations, data) with ProgressBar(): out = out.compute()` if name == "main": main() ``` MVCE confirmation [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. [x] Complete example — the example is self-contained, including all data and the text of any traceback. [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result. [X] New issue — a search of GitHub Issues suggests this is not a duplicate. Relevant log output This is the traceback of "Minimal Complete Verifiable Example 1". Python Traceback (most recent call last): File "scripts/calc_p_out_model.py", line 61, in <module> main() File "scripts/calc_p_out_model.py", line 31, in main calc_power(name="p_out_model", compute_func=compute_func) File "/tmp/reproduce_invalidindexerror/src/wind_power.py", line 136, in calc_power power = power.compute() File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py", line 993, in compute return new.load(kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py", line 967, in load ds = self._to_temp_dataset().load(kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataset.py", line 733, in load evaluated_data = da.compute(lazy_data.values(), kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/base.py", line 600, in compute results = schedule(dsk, keys, kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/threaded.py", line 89, in get results = get_async( File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py", line 511, in get_async raise_exception(exc, tb) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py", line 319, in reraise raise exc File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task result = _execute_task(task, data) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task return func((_execute_task(a, cache) for a in args)) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py", line 119, in <genexpr> return func((_execute_task(a, cache) for a in args)) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task return func((_execute_task(a, cache) for a in args)) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/parallel.py", line 285, in _wrapper result = func(converted_args, kwargs) File "/tmp/reproduce_invalidindexerror/src/wind_power.py", line 100, in <lambda> lambda wind_speeds: worker(wind_speeds).compute().chunk({"time": None}), File "/tmp/reproduce_invalidindexerror/src/wind_power.py", line 50, in worker specific_power_chunk = specific_power.sel(turbines=wind_speeds.turbines) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py", line 1420, in sel ds = self._to_temp_dataset().sel( File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataset.py", line 2533, in sel query_results = map_index_queries( File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexing.py", line 183, in map_index_queries results.append(index.sel(labels, *options)) # type: ignore[call-arg] File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexes.py", line 418, in sel indexer = get_indexer_nd(self.index, label_array, method, tolerance) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexes.py", line 212, in get_indexer_nd flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3729, in get_indexer raise InvalidIndexError(self._requires_unique_msg) pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects Anything else we need to know? Workaround: Use synchronous dask scheduler The issue does not occur if I use the synchronous dask scheduler by adding at the very beginning of my script: `dask.config.set(scheduler='single-threaded')` Additional debugging print If I add the following debugging print to the pandas code: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,6 @@ self._check_indexing_method(method, limit, tolerance) `if not self._index_as_unique:` print("Original: ", len(self), ", length of set:", len(set(self))) raise InvalidIndexError(self._requires_unique_msg) `if len(target) == 0` ``` ...I get the following output: `Original: 3879 , length of set: 3879` So the index seems to be unique, but `self.is_unique` is `False` for some reason (note that `not self._index_as_unique` and `self.is_unique` is the same in this case). Proof of race condtion: addd sleep 1s To confirm that the race condition is at this point we wait for 1s and then check again for uniqueness: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,10 @@ self._check_indexing_method(method, limit, tolerance) `if not self._index_as_unique:` if not self.is_unique: import time time.sleep(1) print("now unique?", self.is_unique) raise InvalidIndexError(self._requires_unique_msg) ``` This outputs: `now unique? True` Environment INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.4.0-125-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.7.3 xarray: 0.15.0 pandas: 0.25.3 numpy: 1.17.4 scipy: 1.3.3 netCDF4: 1.5.3 pydap: None h5netcdf: 0.7.1 h5py: 2.10.0 Nio: None zarr: 2.4.0+ds cftime: 1.1.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: None bottleneck: 1.2.1 dask: 2.8.1+dfsg distributed: None matplotlib: 3.1.2 cartopy: None seaborn: 0.10.0 numbagg: None setuptools: 45.2.0 pip3: None conda: None pytest: 4.6.9 IPython: 7.13.0 sphinx: 1.8.5	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7059/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	issue
1315111684	I_kwDOAMm_X85OYwME	6816	pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks()	lumbric 691772	closed	5	2022-07-22T14:56:41Z	2022-09-13T09:39:48Z	2022-08-19T14:06:09Z	CONTRIBUTOR			What is your issue? I'm doing a lengthy computation, which involves hundreds of GB of data using chunks and map_blocks() so that things fit into RAM and can be done in parallel. From time to time, the following error is raised: `pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects` The line where this takes place looks pretty harmless: `x = a * b.sel(c=d.c)` It's a line inside the function `func` which is passed to a `map_blocks()` call. In this case `a` and `b` are `xr.DataArray` or `xr.DataSet` objects shadowed from outer scope and `d` is the parameter `obj` for `map_blocks()`. That means, the line below in the traceback looks like this: `xr.map_blocks( lambda d: worker(d).compute().chunk({"time": None}), d, template=template)` I guess it's some kind of race condition, since it's not 100% reproducible, but I have no idea how to further investigate the issue to create a proper bug report or fix my code. Do you have any hint how I could continue building a minimal example or so in such a case? What does the error message want to tell me?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6816/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
467494277	MDExOlB1bGxSZXF1ZXN0Mjk3MTM2MjEz	3104	Fix minor typos in documentation	lumbric 691772	closed	2	2019-07-12T16:13:15Z	2019-07-12T16:53:28Z	2019-07-12T16:51:54Z	CONTRIBUTOR	0	pydata/xarray/pulls/3104		{ "url": "https://api.github.com/repos/pydata/xarray/issues/3104/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
467482848	MDExOlB1bGxSZXF1ZXN0Mjk3MTI2ODgw	3103	Add missing assert to unit test	lumbric 691772	closed	1	2019-07-12T15:46:20Z	2019-07-12T16:35:16Z	2019-07-12T16:35:16Z	CONTRIBUTOR	0	pydata/xarray/pulls/3103	Stumbled upon a unit test which didn't test anything.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3103/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
438389323	MDU6SXNzdWU0MzgzODkzMjM=	2928	Dask outputs warning: "The da.atop function has moved to da.blockwise"	lumbric 691772	closed	4	2019-04-29T15:59:31Z	2019-07-12T15:56:29Z	2019-07-12T15:56:28Z	CONTRIBUTOR			Problem description dask 1.1.0 moved `atop()` to `blockwise()` and introduced a warning when `atop()` is used. Related upstream ticket and PR of dask change: dask/dask#4348 dask/dask#4035 the warning in the dask documentation in an xarray example, probably not on purpose warnings have been already discussed in #2727, but not fixed there same issue in a different project: pytroll/satpy#608 Code Sample ```python import numpy as np import xarray as xr xr.DataArray(np.ones(1000)) d = xr.DataArray(np.ones(1000)) d.to_netcdf('/tmp/ones.nc') d = xr.open_dataarray('/tmp/ones.nc', chunks=10) xr.apply_ufunc(lambda x: 42 * x, d, dask='parallelized', output_dtypes=[np.float64]) ``` This outputs the warning: `...lib/python3.7/site-packages/dask/array/blockwise.py:204: UserWarning: The da.atop function has moved to da.blockwise warnings.warn("The da.atop function has moved to da.blockwise")` Expected Output No warning. As user of a recent version of dask and xarray, there shouldn't be any warnings if everything is done right. The warning should be tackled inside xarray somehow. Solution Not sure, can xarray break compatibility with dask <1.1.0 with some future version? Otherwise I guess there needs to be some legacy code in xarray which calls the right function. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.18.0-17-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.1 xarray: 0.12.1 pandas: 0.24.2 numpy: 1.16.3 scipy: 1.2.1 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 1.2.0 distributed: 1.27.0 matplotlib: 3.0.3 cartopy: None seaborn: 0.9.0 setuptools: 41.0.0 pip: 19.1 conda: None pytest: 4.4.1 IPython: 7.5.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2928/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
434444058	MDExOlB1bGxSZXF1ZXN0MjcxNDM1NjU4	2904	Minor improvement of docstring for Dataset	lumbric 691772	closed	6	2019-04-17T19:16:50Z	2019-04-17T20:09:26Z	2019-04-17T20:08:46Z	CONTRIBUTOR	0	pydata/xarray/pulls/2904	This might help to avoid confusion. data_vars is always a mapping, not a mapping, a variable or a tuple. Passing just a tuple, does not work of course. But for xarray newbies, this might be less obvious and the error message is also not easy to interpret: ``` xr.Dataset(('dim1', np.ones(5))) ... TypeError: unhashable type: 'numpy.ndarray' ``` The correct version of the example above should be: ``` xr.Dataset({'myvar': ('dim1', np.ones(5))}) <xarray.Dataset> Dimensions: (dim1: 5) Dimensions without coordinates: dim1 Data variables: myvar (dim1) float64 1.0 1.0 1.0 1.0 1.0 ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2904/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
434439562	MDExOlB1bGxSZXF1ZXN0MjcxNDMyMTc5	2903	Fix minor typos in docstrings	lumbric 691772	closed	1	2019-04-17T19:05:47Z	2019-04-17T19:15:10Z	2019-04-17T19:15:10Z	CONTRIBUTOR	0	pydata/xarray/pulls/2903	See also pull-request #2860 - the same typo was at many places. Sorry, I have missed the other places when sending the first PR.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2903/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
427604384	MDExOlB1bGxSZXF1ZXN0MjY2MTYyNTQw	2860	Fix minor typo in docstring	lumbric 691772	closed	1	2019-04-01T09:35:02Z	2019-04-01T11:18:40Z	2019-04-01T11:18:29Z	CONTRIBUTOR	0	pydata/xarray/pulls/2860		{ "url": "https://api.github.com/repos/pydata/xarray/issues/2860/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
389685381	MDExOlB1bGxSZXF1ZXN0MjM3NjE4MzYx	2598	Fix wrong error message in interp()	lumbric 691772	closed	2	2018-12-11T10:09:53Z	2018-12-11T19:29:03Z	2018-12-11T19:29:03Z	CONTRIBUTOR	0	pydata/xarray/pulls/2598	This is just a minor fix of a wrong error message. Please let me know if you think that this is worth testing in unit tests. Before: ``` import xarray as xr d = xr.DataArray([1,2,3]) d.interp(1) ... ValueError: the first argument to .rename must be a dictionary ``` After: ``` import xarray as xr d = xr.DataArray([1,2,3]) d.interp(1) ... ValueError: the first argument to .interp must be a dictionary ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2598/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

9 rows where repo = 13221727 and user = 691772 sorted by updated_at descending

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example 1

I wasn't able to reproduce the issue with a smaller code example, so I provide all my code and my test data. This should make it possible to reproduce the issue in less than a minute.

Requirements:

- git

- mamba, see https://github.com/mamba-org/mamba

alternatively run the following, will install latest versions from conda-forge:

conda create -n reproduce_invalidindexerror

conda activate reproduce_invalidindexerror

mamba install -c conda-forge python=3.8 matplotlib pytest-cov dask openpyxl pytest pip xarray netcdf4 jupyter pandas scipy flake8 dvc pre-commit pyarrow statsmodels rasterio scikit-learn pytest-watch pdbpp black seaborn

Minimal Complete Verifiable Example 2

Workaround for xarray#6816: Parallel execution causes often an InvalidIndexError

https://github.com/pydata/xarray/issues/6816#issuecomment-1243864752

import dask

dask.config.set(scheduler="single-threaded")

MVCE confirmation

Relevant log output

Anything else we need to know?

Workaround: Use synchronous dask scheduler

Additional debugging print

Proof of race condtion: addd sleep 1s

Environment

What is your issue?

Problem description

Related

Code Sample

Expected Output

Solution

Output of xr.show_versions()

Advanced export

Output of `xr.show_versions()`