issues
3 rows where repo = 13221727, type = "issue" and user = 691772 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: created_at (date), updated_at (date), closed_at (date)
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at ▲ | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1379372915 | I_kwDOAMm_X85SN49z | 7059 | pandas.errors.InvalidIndexError raised when running computation in parallel using dask | lumbric 691772 | open | 0 | 8 | 2022-09-20T12:52:16Z | 2024-03-02T16:43:15Z | CONTRIBUTOR | What happened?I'm doing a computation using chunks and (This issue was initially discussed in #6816, but the ticket was closed, because I couldn't reproduce the problem any longer. Now it seems to be reproducible in every run, so it is time for a proper bug report, which is this ticket here.) What did you expect to happen?Dask schedulers Minimal Complete Verifiable Example 1Edit: I've managed to reduce the verifiable example, see example 2 below. ```Python I wasn't able to reproduce the issue with a smaller code example, so I provide all my code and my test data. This should make it possible to reproduce the issue in less than a minute.Requirements:- git- mamba, see https://github.com/mamba-org/mambagit clone https://github.com/lumbric/reproduce_invalidindexerror.git cd reproduce_invalidindexerror mamba env create -f env.yml alternatively run the following, will install latest versions from conda-forge:conda create -n reproduce_invalidindexerrorconda activate reproduce_invalidindexerrormamba install -c conda-forge python=3.8 matplotlib pytest-cov dask openpyxl pytest pip xarray netcdf4 jupyter pandas scipy flake8 dvc pre-commit pyarrow statsmodels rasterio scikit-learn pytest-watch pdbpp black seabornconda activate reproduce_invalidindexerror dvc repro checks_simulation ``` Minimal Complete Verifiable Example 2```Python import numpy as np import pandas as pd import xarray as xr from multiprocessing import Lock from dask.diagnostics import ProgressBar Workaround for xarray#6816: Parallel execution causes often an InvalidIndexErrorhttps://github.com/pydata/xarray/issues/6816#issuecomment-1243864752import daskdask.config.set(scheduler="single-threaded")def generate_netcdf_files(): fnames = [f"{i:02d}.nc" for i in range(21)] for i, fname in enumerate(fnames): xr.DataArray( np.ones((3879, 48)), dims=("locations", "time"), coords={ "time": pd.date_range(f"{2000 + i}-01-01", periods=48, freq="D"), "locations": np.arange(3879), }, ).to_netcdf(fname) return fnames def compute(locations, data): def resample_annually(data): return data.sortby("time").resample(time="1A", label="left", loffset="1D").mean(dim="time")
def main(): fnames = generate_netcdf_files()
if name == "main": main() ``` MVCE confirmation
Relevant log outputThis is the traceback of "Minimal Complete Verifiable Example 1".
Anything else we need to know?Workaround: Use synchronous dask schedulerThe issue does not occur if I use the synchronous dask scheduler by adding at the very beginning of my script:
Additional debugging printIf I add the following debugging print to the pandas code: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,6 @@ self._check_indexing_method(method, limit, tolerance)
So the index seems to be unique, but Proof of race condtion: addd sleep 1sTo confirm that the race condition is at this point we wait for 1s and then check again for uniqueness: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /opt/miniconda3/envs/reproduce_invalidindexerror/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,10 @@ self._check_indexing_method(method, limit, tolerance)
This outputs:
Environment
INSTALLED VERSIONS
------------------
commit: None
python: 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-125-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.7.3
xarray: 0.15.0
pandas: 0.25.3
numpy: 1.17.4
scipy: 1.3.3
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.7.1
h5py: 2.10.0
Nio: None
zarr: 2.4.0+ds
cftime: 1.1.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.3
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.8.1+dfsg
distributed: None
matplotlib: 3.1.2
cartopy: None
seaborn: 0.10.0
numbagg: None
setuptools: 45.2.0
pip3: None
conda: None
pytest: 4.6.9
IPython: 7.13.0
sphinx: 1.8.5
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/7059/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
xarray 13221727 | issue | ||||||||
1315111684 | I_kwDOAMm_X85OYwME | 6816 | pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks() | lumbric 691772 | closed | 0 | 5 | 2022-07-22T14:56:41Z | 2022-09-13T09:39:48Z | 2022-08-19T14:06:09Z | CONTRIBUTOR | What is your issue?I'm doing a lengthy computation, which involves hundreds of GB of data using chunks and map_blocks() so that things fit into RAM and can be done in parallel. From time to time, the following error is raised:
The line where this takes place looks pretty harmless:
It's a line inside the function That means, the line below in the traceback looks like this:
I guess it's some kind of race condition, since it's not 100% reproducible, but I have no idea how to further investigate the issue to create a proper bug report or fix my code. Do you have any hint how I could continue building a minimal example or so in such a case? What does the error message want to tell me? |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/6816/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | xarray 13221727 | issue | ||||||
438389323 | MDU6SXNzdWU0MzgzODkzMjM= | 2928 | Dask outputs warning: "The da.atop function has moved to da.blockwise" | lumbric 691772 | closed | 0 | 4 | 2019-04-29T15:59:31Z | 2019-07-12T15:56:29Z | 2019-07-12T15:56:28Z | CONTRIBUTOR | Problem descriptiondask 1.1.0 moved Related
Code Sample```python import numpy as np import xarray as xr xr.DataArray(np.ones(1000)) d = xr.DataArray(np.ones(1000)) d.to_netcdf('/tmp/ones.nc') d = xr.open_dataarray('/tmp/ones.nc', chunks=10) xr.apply_ufunc(lambda x: 42 * x, d, dask='parallelized', output_dtypes=[np.float64]) ``` This outputs the warning:
Expected OutputNo warning. As user of a recent version of dask and xarray, there shouldn't be any warnings if everything is done right. The warning should be tackled inside xarray somehow. SolutionNot sure, can xarray break compatibility with dask <1.1.0 with some future version? Otherwise I guess there needs to be some legacy code in xarray which calls the right function. Output of
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/2928/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | xarray 13221727 | issue |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issues] ( [id] INTEGER PRIMARY KEY, [node_id] TEXT, [number] INTEGER, [title] TEXT, [user] INTEGER REFERENCES [users]([id]), [state] TEXT, [locked] INTEGER, [assignee] INTEGER REFERENCES [users]([id]), [milestone] INTEGER REFERENCES [milestones]([id]), [comments] INTEGER, [created_at] TEXT, [updated_at] TEXT, [closed_at] TEXT, [author_association] TEXT, [active_lock_reason] TEXT, [draft] INTEGER, [pull_request] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [state_reason] TEXT, [repo] INTEGER REFERENCES [repos]([id]), [type] TEXT ); CREATE INDEX [idx_issues_repo] ON [issues] ([repo]); CREATE INDEX [idx_issues_milestone] ON [issues] ([milestone]); CREATE INDEX [idx_issues_assignee] ON [issues] ([assignee]); CREATE INDEX [idx_issues_user] ON [issues] ([user]);