home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

3 rows where issue = 1379372915 and user = 691772 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • lumbric · 3 ✖

issue 1

  • pandas.errors.InvalidIndexError raised when running computation in parallel using dask · 3 ✖

author_association 1

  • CONTRIBUTOR 3
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1268031159 https://github.com/pydata/xarray/issues/7059#issuecomment-1268031159 https://api.github.com/repos/pydata/xarray/issues/7059 IC_kwDOAMm_X85LlJ63 lumbric 691772 2022-10-05T07:02:23Z 2022-10-05T07:02:48Z CONTRIBUTOR

I agree with just passing all args explicitly. Does it work otherwise with "processes"?

What do you mean by that?

  1. Why are you chunking iniside the mapped function?

Uhm yes, you are right, this should be removed, not sure how this happened. Removing .chunk({"time": None}) in the lambda function does not change the behavior of the example regarding this issue.

  1. If you conda install flox, the resample operation should be quite efficient, without the need to use map_blocks

Oh wow, thanks! Haven't seen flox before.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  pandas.errors.InvalidIndexError raised when running computation in parallel using dask 1379372915
1254873700 https://github.com/pydata/xarray/issues/7059#issuecomment-1254873700 https://api.github.com/repos/pydata/xarray/issues/7059 IC_kwDOAMm_X85Ky9pk lumbric 691772 2022-09-22T11:09:16Z 2022-09-22T11:09:16Z CONTRIBUTOR

I have managed to reduce the reproducing example (see "Minimal Complete Verifiable Example 2" above) and then also find a proper solution to fix this issue. I am still not sure whether this is a bug or intended behavior, so I'll won't close the issue for now.

Basically the issue occurs when a chunked NetCDF file is loaded from disk, passed to xarray.map_blocks() and is then used in .sel() as parameter to get a subset of some other xarray object which is not passed to the worker func(). I think the proper solution is to use the args parameter of map_blocks() instead of .sel():

``` --- run-broken.py 2022-09-22 13:00:41.095555961 +0200 +++ run.py 2022-09-22 13:01:14.452696511 +0200 @@ -30,17 +30,17 @@ def resample_annually(data): return data.sortby("time").resample(time="1A", label="left", loffset="1D").mean(dim="time")

  • def worker(data):
  • locations_chunk = locations.sel(locations=data.locations)
  • out_raw = data * locations_chunk
  • def worker(data, locations):
  • out_raw = data * locations out = resample_annually(out_raw) return out

    template = resample_annually(data)

    out = xr.map_blocks( - lambda data: worker(data).compute().chunk({"time": None}), + lambda data, locations: worker(data, locations).compute().chunk({"time": None}), data, + (locations,), template=template, ) ```

This seems to fix this issue and seems to be the proper solution anyway.

I still don't see why I am not allowed to use .sel() on shadowed objects in the worker func()´. Is this on purpose? If yes, should we add something to the documentation? Is this a specific behavior ofmap_blocks()`? Is it related to #6904?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  pandas.errors.InvalidIndexError raised when running computation in parallel using dask 1379372915
1252561840 https://github.com/pydata/xarray/issues/7059#issuecomment-1252561840 https://api.github.com/repos/pydata/xarray/issues/7059 IC_kwDOAMm_X85KqJOw lumbric 691772 2022-09-20T15:54:48Z 2022-09-20T15:54:48Z CONTRIBUTOR

@benbovy thanks for the hint! I tried passing an explicit lock to xr.open_mfdataset() as suggested, but didn't change anything, still the same exception. I will double check, if I did it the right way, I might be missing something.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  pandas.errors.InvalidIndexError raised when running computation in parallel using dask 1379372915

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.417ms · About: xarray-datasette