home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

14 rows where user = 691772 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, created_at (date), updated_at (date)

issue 6

  • bottleneck : Wrong mean for float32 array 3
  • pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks() 3
  • pandas.errors.InvalidIndexError raised when running computation in parallel using dask 3
  • Minor improvement of docstring for Dataset 2
  • Dask outputs warning: "The da.atop function has moved to da.blockwise" 2
  • Memory leak while looping through a Dataset 1

user 1

  • lumbric · 14 ✖

author_association 1

  • CONTRIBUTOR 14
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1268031159 https://github.com/pydata/xarray/issues/7059#issuecomment-1268031159 https://api.github.com/repos/pydata/xarray/issues/7059 IC_kwDOAMm_X85LlJ63 lumbric 691772 2022-10-05T07:02:23Z 2022-10-05T07:02:48Z CONTRIBUTOR

I agree with just passing all args explicitly. Does it work otherwise with "processes"?

What do you mean by that?

  1. Why are you chunking iniside the mapped function?

Uhm yes, you are right, this should be removed, not sure how this happened. Removing .chunk({"time": None}) in the lambda function does not change the behavior of the example regarding this issue.

  1. If you conda install flox, the resample operation should be quite efficient, without the need to use map_blocks

Oh wow, thanks! Haven't seen flox before.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  pandas.errors.InvalidIndexError raised when running computation in parallel using dask 1379372915
1254873700 https://github.com/pydata/xarray/issues/7059#issuecomment-1254873700 https://api.github.com/repos/pydata/xarray/issues/7059 IC_kwDOAMm_X85Ky9pk lumbric 691772 2022-09-22T11:09:16Z 2022-09-22T11:09:16Z CONTRIBUTOR

I have managed to reduce the reproducing example (see "Minimal Complete Verifiable Example 2" above) and then also find a proper solution to fix this issue. I am still not sure whether this is a bug or intended behavior, so I'll won't close the issue for now.

Basically the issue occurs when a chunked NetCDF file is loaded from disk, passed to xarray.map_blocks() and is then used in .sel() as parameter to get a subset of some other xarray object which is not passed to the worker func(). I think the proper solution is to use the args parameter of map_blocks() instead of .sel():

``` --- run-broken.py 2022-09-22 13:00:41.095555961 +0200 +++ run.py 2022-09-22 13:01:14.452696511 +0200 @@ -30,17 +30,17 @@ def resample_annually(data): return data.sortby("time").resample(time="1A", label="left", loffset="1D").mean(dim="time")

  • def worker(data):
  • locations_chunk = locations.sel(locations=data.locations)
  • out_raw = data * locations_chunk
  • def worker(data, locations):
  • out_raw = data * locations out = resample_annually(out_raw) return out

    template = resample_annually(data)

    out = xr.map_blocks( - lambda data: worker(data).compute().chunk({"time": None}), + lambda data, locations: worker(data, locations).compute().chunk({"time": None}), data, + (locations,), template=template, ) ```

This seems to fix this issue and seems to be the proper solution anyway.

I still don't see why I am not allowed to use .sel() on shadowed objects in the worker func()´. Is this on purpose? If yes, should we add something to the documentation? Is this a specific behavior ofmap_blocks()`? Is it related to #6904?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  pandas.errors.InvalidIndexError raised when running computation in parallel using dask 1379372915
1252561840 https://github.com/pydata/xarray/issues/7059#issuecomment-1252561840 https://api.github.com/repos/pydata/xarray/issues/7059 IC_kwDOAMm_X85KqJOw lumbric 691772 2022-09-20T15:54:48Z 2022-09-20T15:54:48Z CONTRIBUTOR

@benbovy thanks for the hint! I tried passing an explicit lock to xr.open_mfdataset() as suggested, but didn't change anything, still the same exception. I will double check, if I did it the right way, I might be missing something.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  pandas.errors.InvalidIndexError raised when running computation in parallel using dask 1379372915
1243864752 https://github.com/pydata/xarray/issues/6816#issuecomment-1243864752 https://api.github.com/repos/pydata/xarray/issues/6816 IC_kwDOAMm_X85KI96w lumbric 691772 2022-09-12T14:55:06Z 2022-09-13T09:39:48Z CONTRIBUTOR

Not sure what changed, but now I do get the same error also with my small and synthetic test data. This way I was able to debug a bit further. I am pretty sure this is a bug in xarray or pandas.

I think something in pandas.core.indexes.base.Index is not thread-safe. At least this seems to be the place of the race condition.

I can create a new ticket, if you prefer, but since I am not sure in which project, I will continue to collect information here. Unfortunately I have not yet managed to create a minimal example as this is quite tricky with timing issues.

Additional debugging print and proof of race condition

If I add the following debugging print to the pandas code:

``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,6 @@ self._check_indexing_method(method, limit, tolerance)

     if not self._index_as_unique:
  • print("Original: ", len(self), ", length of set:", len(set(self))) raise InvalidIndexError(self._requires_unique_msg)
     if len(target) == 0
    

    ``` ...I get the following output:

Original: 3879 , length of set: 3879

So the index seems to be unique, but self.is_unique is False for some reason (note that not self._index_as_unique and self.is_unique is the same in this case).

To confirm that the race condition is at this point we wait for 1s and then check again for uniqueness:

``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,10 @@ self._check_indexing_method(method, limit, tolerance)

     if not self._index_as_unique:
  • if not self.is_unique:
  • import time
  • time.sleep(1)
  • print("now unique?", self.is_unique) raise InvalidIndexError(self._requires_unique_msg) ```

This outputs:

now unique? True

Traceback

Traceback (most recent call last): File "scripts/my_script.py", line 57, in <module> main() File "scripts/my_script.py", line 48, in main my_function( File "/home/lumbric/my_project/src/calculations.py", line 136, in my_function result = result.compute() File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py", line 947, in compute return new.load(**kwargs) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py", line 921, in load ds = self._to_temp_dataset().load(**kwargs) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataset.py", line 861, in load evaluated_data = da.compute(*lazy_data.values(), **kwargs) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/base.py", line 600, in compute results = schedule(dsk, keys, **kwargs) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/threaded.py", line 81, in get results = get_async( File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py", line 508, in get_async raise_exception(exc, tb) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py", line 316, in reraise raise exc File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py", line 221, in execute_task result = _execute_task(task, data) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py", line 119, in <genexpr> return func(*(_execute_task(a, cache) for a in args)) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/parallel.py", line 285, in _wrapper result = func(*converted_args, **kwargs) File "/home/lumbric/some_project/src/calculations.py", line 100, in <lambda> lambda input_data: worker(input_data).compute().chunk({"time": None}), File "/home/lumbric/some_project/src/calculations.py", line 69, in worker raise e File "/home/lumbric/some_project/src/calculations.py", line 60, in worker out = some_data * some_other_data.sel(some_dimension=input_data.some_dimension) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py", line 1329, in sel ds = self._to_temp_dataset().sel( File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataset.py", line 2502, in sel pos_indexers, new_indexes = remap_label_indexers( File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/coordinates.py", line 421, in remap_label_indexers pos_indexers, new_indexes = indexing.remap_label_indexers( File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexing.py", line 121, in remap_label_indexers idxr, new_idx = index.query(labels, method=method, tolerance=tolerance) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexes.py", line 245, in query indexer = get_indexer_nd(self.index, label, method, tolerance) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexes.py", line 142, in get_indexer_nd flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3722, in get_indexer raise InvalidIndexError(self._requires_unique_msg) pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Workaround

The issue does not occur if I use the synchronous dask scheduler by adding at the very beginning of my script:

dask.config.set(scheduler='single-threaded')

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 5.4.0-124-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.22.4 scipy: 1.8.1 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: None dask: 2022.05.2 distributed: 2022.5.2 matplotlib: 3.5.2 cartopy: None seaborn: 0.11.2 numbagg: None fsspec: 2022.5.0 cupy: None pint: None sparse: None setuptools: 62.3.2 pip: 22.1.2 conda: 4.12.0 pytest: 7.1.2 IPython: 8.4.0 sphinx: None
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks() 1315111684
1243882465 https://github.com/pydata/xarray/issues/6816#issuecomment-1243882465 https://api.github.com/repos/pydata/xarray/issues/6816 IC_kwDOAMm_X85KJCPh lumbric 691772 2022-09-12T15:07:45Z 2022-09-12T15:07:45Z CONTRIBUTOR

I think these are the values of the index, the values seem to be unique and monotonic.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks() 1315111684
1220519740 https://github.com/pydata/xarray/issues/6816#issuecomment-1220519740 https://api.github.com/repos/pydata/xarray/issues/6816 IC_kwDOAMm_X85Iv6c8 lumbric 691772 2022-08-19T10:33:59Z 2022-08-19T10:33:59Z CONTRIBUTOR

Thanks a lot for your quick reply and your helpful hints!

In the meantime I have verified that: d.c is unique, i.e. np.unqiue(d.c).size == d.c.size

Unfortunately I was not able to reproduce the error often enough lately to test it with the synchronous scheduler nor to create a smaller synthetic example which reproduces the problem. One run takes about an hour until the exception occurs (or not), which makes things hard to debug. But I will continue trying and keep this ticket updated.

Any further suggestions very welcome :) Thanks a lot!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks() 1315111684
1046665303 https://github.com/pydata/xarray/issues/2186#issuecomment-1046665303 https://api.github.com/repos/pydata/xarray/issues/2186 IC_kwDOAMm_X84-YthX lumbric 691772 2022-02-21T09:41:00Z 2022-02-21T09:41:00Z CONTRIBUTOR

I just stumbled across the same issue and created a minimal example similar to @lkilcher. I am using xr.open_dataarray() with chunks and do some simple computation. After that 800mb of RAM is used, no matter whether I close the file explicitly, delete the xarray objects or invoke the Python garbage collector.

What seems to work: do not use the threading Dask scheduler. The issue does not seem to occur with the single-threaded or processes scheduler. Also setting MALLOC_MMAP_MAX_=40960 seems to solve the issue as suggested above (disclaimer: I don't fully understand the details here).

If I understand things correctly, this indicates that the issue is a consequence of dask/dask#3530. Not sure if there is anything to be fixed on the xarray side or what would be the best work around. I will try to use the processes scheduler.

I can create a new (xarray) ticket with all details about the minimal example, if anyone thinks that this might be helpful (to collect work-a-rounds or discuss fixes on the xarray side).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
510939525 https://github.com/pydata/xarray/issues/2928#issuecomment-510939525 https://api.github.com/repos/pydata/xarray/issues/2928 MDEyOklzc3VlQ29tbWVudDUxMDkzOTUyNQ== lumbric 691772 2019-07-12T15:56:28Z 2019-07-12T15:56:28Z CONTRIBUTOR

Fixed in 714ae8661a829d.

(Sorry for the delay... I actually prepared a PR but never finished it completely even it was such a simple thing.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dask outputs warning: "The da.atop function has moved to da.blockwise" 438389323
487759590 https://github.com/pydata/xarray/issues/2928#issuecomment-487759590 https://api.github.com/repos/pydata/xarray/issues/2928 MDEyOklzc3VlQ29tbWVudDQ4Nzc1OTU5MA== lumbric 691772 2019-04-29T22:00:58Z 2019-04-29T22:00:58Z CONTRIBUTOR

Any interest in putting together a PR?

Yes, can do so. When writing the report, I actually thought maybe preparing a PR is easier to write and to read than the ticket... :) In this case it really shouldn't be a big deal fixing it.

Maybe a bit off-topic: The thing I don't really understand and why I wanted to ask first: is there a clear paradigm about compatibility in the pydata universe? Despite its 0.x version number, I guess xarray tries to stay backward compatible regarding its public interface, right? When are the versions of dependencies increase? Simply motivated by need of new features in one of the dependent libraries?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dask outputs warning: "The da.atop function has moved to da.blockwise" 438389323
484239080 https://github.com/pydata/xarray/pull/2904#issuecomment-484239080 https://api.github.com/repos/pydata/xarray/issues/2904 MDEyOklzc3VlQ29tbWVudDQ4NDIzOTA4MA== lumbric 691772 2019-04-17T20:00:22Z 2019-04-17T20:00:22Z CONTRIBUTOR

Ah yes, true! I've confused something here. dict() accepts mappings, but not everything dict() accepts is a mapping. xr.Dataset() actually accepts only mappings. That makes actually things a bit easier and much clearer.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Minor improvement of docstring for Dataset 434444058
484232306 https://github.com/pydata/xarray/pull/2904#issuecomment-484232306 https://api.github.com/repos/pydata/xarray/issues/2904 MDEyOklzc3VlQ29tbWVudDQ4NDIzMjMwNg== lumbric 691772 2019-04-17T19:39:42Z 2019-04-17T19:39:42Z CONTRIBUTOR

Hm yes, good error messages would be great, but I feel like it is widely accepted that in the scientific Python ecosystem error messages are hard to read quite often. Maybe this is the downside the duck typing? I've mentioned this only as explanation why I was so confused after running xr.Dataset for the first time.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Minor improvement of docstring for Dataset 434444058
464338041 https://github.com/pydata/xarray/issues/1346#issuecomment-464338041 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDQ2NDMzODA0MQ== lumbric 691772 2019-02-16T11:20:20Z 2019-02-16T11:20:20Z CONTRIBUTOR

Oh yes, of course! I've underestimated the low precision of float32 values above 2**24. Thanks for the hint.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
463324373 https://github.com/pydata/xarray/issues/1346#issuecomment-463324373 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDQ2MzMyNDM3Mw== lumbric 691772 2019-02-13T19:02:52Z 2019-02-16T10:53:51Z CONTRIBUTOR

I think (!) xarray is not effected any longer, but pandas is. Bisecting the GIT history leads to commit 0b9ab2d1, which means that xarray >= v0.10.9 should not be affected. Uninstalling bottleneck is also a valid workaround.

<s>Bottleneck's documentation explicitly mentions that no error is raised in case of an overflow. But it seams to be very evil behavior, so it might be worth reporting upstream.</s> What do you think? (I think kwgoodman/bottleneck#164 is something different, isn't it?) Edit: this is not an overflow. It's a numerical error by not applying pairwise summation.

A couple of minimal examples:

```python

import numpy as np import pandas as pd import xarray as xr import bottleneck as bn bn.nanmean(np.ones(225, dtype=np.float32))
0.5 pd.Series(np.ones(2
25, dtype=np.float32)).mean()
0.5 xr.DataArray(np.ones(2**25, dtype=np.float32)).mean() # not affected for this version <xarray.DataArray ()> array(1., dtype=float32) ```

Done with the following versions: bash $ pip3 freeze Bottleneck==1.2.1 numpy==1.16.1 pandas==0.24.1 xarray==0.11.3 ...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353
464016154 https://github.com/pydata/xarray/issues/1346#issuecomment-464016154 https://api.github.com/repos/pydata/xarray/issues/1346 MDEyOklzc3VlQ29tbWVudDQ2NDAxNjE1NA== lumbric 691772 2019-02-15T11:41:36Z 2019-02-15T11:41:36Z CONTRIBUTOR

Oh hm, I think I didn't really understand what happens in bottleneck.nanmean(). I understand that integers can overflow and that float32 have varying absolute precision. The max float32 3.4E+38 is not hit here. So how can the mean of a list of ones be 0.5?

Isn't this what bottleneck is doing? Summing up a bunch of float32 values and then dividing by the length?

```

d = np.ones(2**25, dtype=np.float32) d.sum()/np.float32(len(d)) 1.0 ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bottleneck : Wrong mean for float32 array 218459353

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 2878.369ms · About: xarray-datasette