home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

9 rows where repo = 13221727 and user = 691772 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, body, created_at (date), updated_at (date), closed_at (date)

type 2

  • pull 6
  • issue 3

state 2

  • closed 8
  • open 1

repo 1

  • xarray · 9 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1379372915 I_kwDOAMm_X85SN49z 7059 pandas.errors.InvalidIndexError raised when running computation in parallel using dask lumbric 691772 open 0     8 2022-09-20T12:52:16Z 2024-03-02T16:43:15Z   CONTRIBUTOR      

What happened?

I'm doing a computation using chunks and map_blocks() to run things in parallel. At some point a pandas.errors.InvalidIndexError is raised. When using dask's synchronous scheduler, everything works fine. I think pandas.core.indexes.base.Index is not thread-safe. At least this seems to be the place of the race condition. See further tests below.

(This issue was initially discussed in #6816, but the ticket was closed, because I couldn't reproduce the problem any longer. Now it seems to be reproducible in every run, so it is time for a proper bug report, which is this ticket here.)

What did you expect to happen?

Dask schedulers single-threaded and threads should have the same result.

Minimal Complete Verifiable Example 1

Edit: I've managed to reduce the verifiable example, see example 2 below.

```Python

I wasn't able to reproduce the issue with a smaller code example, so I provide all my code and my test data. This should make it possible to reproduce the issue in less than a minute.

Requirements:

- git

- mamba, see https://github.com/mamba-org/mamba

git clone https://github.com/lumbric/reproduce_invalidindexerror.git cd reproduce_invalidindexerror

mamba env create -f env.yml

alternatively run the following, will install latest versions from conda-forge:

conda create -n reproduce_invalidindexerror

conda activate reproduce_invalidindexerror

mamba install -c conda-forge python=3.8 matplotlib pytest-cov dask openpyxl pytest pip xarray netcdf4 jupyter pandas scipy flake8 dvc pre-commit pyarrow statsmodels rasterio scikit-learn pytest-watch pdbpp black seaborn

conda activate reproduce_invalidindexerror

dvc repro checks_simulation ```

Minimal Complete Verifiable Example 2

```Python import numpy as np import pandas as pd import xarray as xr

from multiprocessing import Lock from dask.diagnostics import ProgressBar

Workaround for xarray#6816: Parallel execution causes often an InvalidIndexError

https://github.com/pydata/xarray/issues/6816#issuecomment-1243864752

import dask

dask.config.set(scheduler="single-threaded")

def generate_netcdf_files(): fnames = [f"{i:02d}.nc" for i in range(21)] for i, fname in enumerate(fnames): xr.DataArray( np.ones((3879, 48)), dims=("locations", "time"), coords={ "time": pd.date_range(f"{2000 + i}-01-01", periods=48, freq="D"), "locations": np.arange(3879), }, ).to_netcdf(fname) return fnames

def compute(locations, data): def resample_annually(data): return data.sortby("time").resample(time="1A", label="left", loffset="1D").mean(dim="time")

def worker(data):
    locations_chunk = locations.sel(locations=data.locations)
    out_raw = data * locations_chunk
    out = resample_annually(out_raw)
    return out

template = resample_annually(data)

out = xr.map_blocks(
    lambda data: worker(data).compute().chunk({"time": None}),
    data,
    template=template,
)

return out

def main(): fnames = generate_netcdf_files()

locations = xr.DataArray(
    np.ones(3879),
    dims="locations",
    coords={"locations": np.arange(3879)},
)

data = xr.open_mfdataset(
    fnames,
    combine="by_coords",
    chunks={"locations": 4000, "time": None},
    # suggested as solution in
    # lock=Lock(),
).__xarray_dataarray_variable__

out = compute(locations, data)

with ProgressBar():
    out = out.compute()

if name == "main": main() ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

This is the traceback of "Minimal Complete Verifiable Example 1".

Python Traceback (most recent call last): File "scripts/calc_p_out_model.py", line 61, in <module> main() File "scripts/calc_p_out_model.py", line 31, in main calc_power(name="p_out_model", compute_func=compute_func) File "/tmp/reproduce_invalidindexerror/src/wind_power.py", line 136, in calc_power power = power.compute() File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py", line 993, in compute return new.load(**kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py", line 967, in load ds = self._to_temp_dataset().load(**kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataset.py", line 733, in load evaluated_data = da.compute(*lazy_data.values(), **kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/base.py", line 600, in compute results = schedule(dsk, keys, **kwargs) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/threaded.py", line 89, in get results = get_async( File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py", line 511, in get_async raise_exception(exc, tb) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py", line 319, in reraise raise exc File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task result = _execute_task(task, data) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py", line 119, in <genexpr> return func(*(_execute_task(a, cache) for a in args)) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/parallel.py", line 285, in _wrapper result = func(*converted_args, **kwargs) File "/tmp/reproduce_invalidindexerror/src/wind_power.py", line 100, in <lambda> lambda wind_speeds: worker(wind_speeds).compute().chunk({"time": None}), File "/tmp/reproduce_invalidindexerror/src/wind_power.py", line 50, in worker specific_power_chunk = specific_power.sel(turbines=wind_speeds.turbines) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataarray.py", line 1420, in sel ds = self._to_temp_dataset().sel( File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/dataset.py", line 2533, in sel query_results = map_index_queries( File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexing.py", line 183, in map_index_queries results.append(index.sel(labels, **options)) # type: ignore[call-arg] File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexes.py", line 418, in sel indexer = get_indexer_nd(self.index, label_array, method, tolerance) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/xarray/core/indexes.py", line 212, in get_indexer_nd flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance) File "/opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3729, in get_indexer raise InvalidIndexError(self._requires_unique_msg) pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Anything else we need to know?

Workaround: Use synchronous dask scheduler

The issue does not occur if I use the synchronous dask scheduler by adding at the very beginning of my script:

dask.config.set(scheduler='single-threaded')

Additional debugging print

If I add the following debugging print to the pandas code:

``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,6 @@ self._check_indexing_method(method, limit, tolerance)

     if not self._index_as_unique:
  • print("Original: ", len(self), ", length of set:", len(set(self))) raise InvalidIndexError(self._requires_unique_msg)
     if len(target) == 0
    

    ``` ...I get the following output:

Original: 3879 , length of set: 3879

So the index seems to be unique, but self.is_unique is False for some reason (note that not self._index_as_unique and self.is_unique is the same in this case).

Proof of race condtion: addd sleep 1s

To confirm that the race condition is at this point we wait for 1s and then check again for uniqueness:

``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,10 @@ self._check_indexing_method(method, limit, tolerance)

     if not self._index_as_unique:
  • if not self.is_unique:
  • import time
  • time.sleep(1)
  • print("now unique?", self.is_unique) raise InvalidIndexError(self._requires_unique_msg) ```

This outputs:

now unique? True

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.4.0-125-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.7.3 xarray: 0.15.0 pandas: 0.25.3 numpy: 1.17.4 scipy: 1.3.3 netCDF4: 1.5.3 pydap: None h5netcdf: 0.7.1 h5py: 2.10.0 Nio: None zarr: 2.4.0+ds cftime: 1.1.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: None bottleneck: 1.2.1 dask: 2.8.1+dfsg distributed: None matplotlib: 3.1.2 cartopy: None seaborn: 0.10.0 numbagg: None setuptools: 45.2.0 pip3: None conda: None pytest: 4.6.9 IPython: 7.13.0 sphinx: 1.8.5
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7059/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1315111684 I_kwDOAMm_X85OYwME 6816 pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks() lumbric 691772 closed 0     5 2022-07-22T14:56:41Z 2022-09-13T09:39:48Z 2022-08-19T14:06:09Z CONTRIBUTOR      

What is your issue?

I'm doing a lengthy computation, which involves hundreds of GB of data using chunks and map_blocks() so that things fit into RAM and can be done in parallel. From time to time, the following error is raised:

pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

The line where this takes place looks pretty harmless:

x = a * b.sel(c=d.c)

It's a line inside the function func which is passed to a map_blocks() call. In this case a and b are xr.DataArray or xr.DataSet objects shadowed from outer scope and d is the parameter obj for map_blocks().

That means, the line below in the traceback looks like this:

xr.map_blocks(
    lambda d: worker(d).compute().chunk({"time": None}),
    d,
    template=template)

I guess it's some kind of race condition, since it's not 100% reproducible, but I have no idea how to further investigate the issue to create a proper bug report or fix my code.

Do you have any hint how I could continue building a minimal example or so in such a case? What does the error message want to tell me?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6816/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
467494277 MDExOlB1bGxSZXF1ZXN0Mjk3MTM2MjEz 3104 Fix minor typos in documentation lumbric 691772 closed 0     2 2019-07-12T16:13:15Z 2019-07-12T16:53:28Z 2019-07-12T16:51:54Z CONTRIBUTOR   0 pydata/xarray/pulls/3104
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3104/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
467482848 MDExOlB1bGxSZXF1ZXN0Mjk3MTI2ODgw 3103 Add missing assert to unit test lumbric 691772 closed 0     1 2019-07-12T15:46:20Z 2019-07-12T16:35:16Z 2019-07-12T16:35:16Z CONTRIBUTOR   0 pydata/xarray/pulls/3103

Stumbled upon a unit test which didn't test anything.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3103/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
438389323 MDU6SXNzdWU0MzgzODkzMjM= 2928 Dask outputs warning: "The da.atop function has moved to da.blockwise" lumbric 691772 closed 0     4 2019-04-29T15:59:31Z 2019-07-12T15:56:29Z 2019-07-12T15:56:28Z CONTRIBUTOR      

Problem description

dask 1.1.0 moved atop() to blockwise() and introduced a warning when atop() is used.

Related

  • upstream ticket and PR of dask change: dask/dask#4348 dask/dask#4035
  • the warning in the dask documentation in an xarray example, probably not on purpose
  • warnings have been already discussed in #2727, but not fixed there
  • same issue in a different project: pytroll/satpy#608

Code Sample

```python import numpy as np import xarray as xr

xr.DataArray(np.ones(1000)) d = xr.DataArray(np.ones(1000)) d.to_netcdf('/tmp/ones.nc') d = xr.open_dataarray('/tmp/ones.nc', chunks=10) xr.apply_ufunc(lambda x: 42 * x, d, dask='parallelized', output_dtypes=[np.float64]) ```

This outputs the warning: ...lib/python3.7/site-packages/dask/array/blockwise.py:204: UserWarning: The da.atop function has moved to da.blockwise warnings.warn("The da.atop function has moved to da.blockwise")

Expected Output

No warning. As user of a recent version of dask and xarray, there shouldn't be any warnings if everything is done right. The warning should be tackled inside xarray somehow.

Solution

Not sure, can xarray break compatibility with dask <1.1.0 with some future version? Otherwise I guess there needs to be some legacy code in xarray which calls the right function.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.18.0-17-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.1 xarray: 0.12.1 pandas: 0.24.2 numpy: 1.16.3 scipy: 1.2.1 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 1.2.0 distributed: 1.27.0 matplotlib: 3.0.3 cartopy: None seaborn: 0.9.0 setuptools: 41.0.0 pip: 19.1 conda: None pytest: 4.4.1 IPython: 7.5.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2928/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
434444058 MDExOlB1bGxSZXF1ZXN0MjcxNDM1NjU4 2904 Minor improvement of docstring for Dataset lumbric 691772 closed 0     6 2019-04-17T19:16:50Z 2019-04-17T20:09:26Z 2019-04-17T20:08:46Z CONTRIBUTOR   0 pydata/xarray/pulls/2904

This might help to avoid confusion. data_vars is always a mapping, not a mapping, a variable or a tuple.

Passing just a tuple, does not work of course. But for xarray newbies, this might be less obvious and the error message is also not easy to interpret: ```

xr.Dataset(('dim1', np.ones(5))) ... TypeError: unhashable type: 'numpy.ndarray' ```

The correct version of the example above should be: ```

xr.Dataset({'myvar': ('dim1', np.ones(5))})
<xarray.Dataset> Dimensions: (dim1: 5) Dimensions without coordinates: dim1 Data variables: myvar (dim1) float64 1.0 1.0 1.0 1.0 1.0 ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2904/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
434439562 MDExOlB1bGxSZXF1ZXN0MjcxNDMyMTc5 2903 Fix minor typos in docstrings lumbric 691772 closed 0     1 2019-04-17T19:05:47Z 2019-04-17T19:15:10Z 2019-04-17T19:15:10Z CONTRIBUTOR   0 pydata/xarray/pulls/2903

See also pull-request #2860 - the same typo was at many places. Sorry, I have missed the other places when sending the first PR.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2903/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
427604384 MDExOlB1bGxSZXF1ZXN0MjY2MTYyNTQw 2860 Fix minor typo in docstring lumbric 691772 closed 0     1 2019-04-01T09:35:02Z 2019-04-01T11:18:40Z 2019-04-01T11:18:29Z CONTRIBUTOR   0 pydata/xarray/pulls/2860
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2860/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
389685381 MDExOlB1bGxSZXF1ZXN0MjM3NjE4MzYx 2598 Fix wrong error message in interp() lumbric 691772 closed 0     2 2018-12-11T10:09:53Z 2018-12-11T19:29:03Z 2018-12-11T19:29:03Z CONTRIBUTOR   0 pydata/xarray/pulls/2598

This is just a minor fix of a wrong error message. Please let me know if you think that this is worth testing in unit tests.

Before:

```

import xarray as xr d = xr.DataArray([1,2,3]) d.interp(1) ... ValueError: the first argument to .rename must be a dictionary ```

After: ```

import xarray as xr d = xr.DataArray([1,2,3]) d.interp(1) ...

ValueError: the first argument to .interp must be a dictionary ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2598/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 44.996ms · About: xarray-datasette