html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/7059#issuecomment-1268031159,https://api.github.com/repos/pydata/xarray/issues/7059,1268031159,IC_kwDOAMm_X85LlJ63,691772,2022-10-05T07:02:23Z,2022-10-05T07:02:48Z,CONTRIBUTOR,"> I agree with just passing all args explicitly. Does it work otherwise with `""processes""`?
What do you mean by that?
> 1. Why are you chunking iniside the mapped function?
Uhm yes, you are right, this should be removed, not sure how this happened. Removing `.chunk({""time"": None})` in the lambda function does not change the behavior of the example regarding this issue.
> 2. If you `conda install flox`, the resample operation should be quite efficient, without the need to use `map_blocks`
Oh wow, thanks! Haven't seen flox before.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1379372915
https://github.com/pydata/xarray/issues/7059#issuecomment-1254873700,https://api.github.com/repos/pydata/xarray/issues/7059,1254873700,IC_kwDOAMm_X85Ky9pk,691772,2022-09-22T11:09:16Z,2022-09-22T11:09:16Z,CONTRIBUTOR,"I have managed to reduce the reproducing example (see ""Minimal Complete Verifiable Example 2"" above) and then also find a proper solution to fix this issue. I am still not sure whether this is a bug or intended behavior, so I'll won't close the issue for now.
Basically the issue occurs when a chunked NetCDF file is loaded from disk, passed to `xarray.map_blocks()` and is then used in `.sel()` as parameter to get a subset of some other xarray object which is not passed to the worker `func()`. I think the proper solution is to use the `args` parameter of `map_blocks()` instead of `.sel()`:
```
--- run-broken.py 2022-09-22 13:00:41.095555961 +0200
+++ run.py 2022-09-22 13:01:14.452696511 +0200
@@ -30,17 +30,17 @@
def resample_annually(data):
return data.sortby(""time"").resample(time=""1A"", label=""left"", loffset=""1D"").mean(dim=""time"")
- def worker(data):
- locations_chunk = locations.sel(locations=data.locations)
- out_raw = data * locations_chunk
+ def worker(data, locations):
+ out_raw = data * locations
out = resample_annually(out_raw)
return out
template = resample_annually(data)
out = xr.map_blocks(
- lambda data: worker(data).compute().chunk({""time"": None}),
+ lambda data, locations: worker(data, locations).compute().chunk({""time"": None}),
data,
+ (locations,),
template=template,
)
```
This seems to fix this issue and seems to be the proper solution anyway.
I still don't see why I am not allowed to use `.sel()` on shadowed objects in the worker `func()ยด. Is this on purpose? If yes, should we add something to the documentation? Is this a specific behavior of `map_blocks()`? Is it related to #6904?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1379372915
https://github.com/pydata/xarray/issues/7059#issuecomment-1252561840,https://api.github.com/repos/pydata/xarray/issues/7059,1252561840,IC_kwDOAMm_X85KqJOw,691772,2022-09-20T15:54:48Z,2022-09-20T15:54:48Z,CONTRIBUTOR,"@benbovy thanks for the hint! I tried passing an explicit lock to `xr.open_mfdataset()` [as suggested](https://github.com/pydata/xarray/issues/6904#issuecomment-1210233503), but didn't change anything, still the same exception. I will double check, if I did it the right way, I might be missing something.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1379372915
https://github.com/pydata/xarray/issues/6816#issuecomment-1243864752,https://api.github.com/repos/pydata/xarray/issues/6816,1243864752,IC_kwDOAMm_X85KI96w,691772,2022-09-12T14:55:06Z,2022-09-13T09:39:48Z,CONTRIBUTOR,"Not sure what changed, but now I do get the same error also with my small and synthetic test data. This way I was able to debug a bit further. I am pretty sure this is a bug in xarray or pandas.
I think something in **`pandas.core.indexes.base.Index` is not thread-safe**. At least this seems to be the place of the race condition.
I can create a new ticket, if you prefer, but since I am not sure in which project, I will continue to collect information here. Unfortunately I have not yet managed to create a minimal example as this is quite tricky with timing issues.
### Additional debugging print and proof of race condition
If I add the following debugging print to the pandas code:
```
--- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200
+++ /home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200
@@ -3718,7 +3718,6 @@
self._check_indexing_method(method, limit, tolerance)
if not self._index_as_unique:
+ print(""Original: "", len(self), "", length of set:"", len(set(self)))
raise InvalidIndexError(self._requires_unique_msg)
if len(target) == 0
```
...I get the following output:
```
Original: 3879 , length of set: 3879
```
So the index seems to be unique, but `self.is_unique` is `False` for some reason (note that `not self._index_as_unique` and `self.is_unique` is the same in this case).
To confirm that the race condition is at this point we wait for 1s and then check again for uniqueness:
```
--- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200
+++ /home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200
@@ -3718,7 +3718,10 @@
self._check_indexing_method(method, limit, tolerance)
if not self._index_as_unique:
+ if not self.is_unique:
+ import time
+ time.sleep(1)
+ print(""now unique?"", self.is_unique)
raise InvalidIndexError(self._requires_unique_msg)
```
This outputs:
```
now unique? True
```
### Traceback
```
Traceback (most recent call last):
File ""scripts/my_script.py"", line 57, in
main()
File ""scripts/my_script.py"", line 48, in main
my_function(
File ""/home/lumbric/my_project/src/calculations.py"", line 136, in my_function
result = result.compute()
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 947, in compute
return new.load(**kwargs)
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 921, in load
ds = self._to_temp_dataset().load(**kwargs)
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataset.py"", line 861, in load
evaluated_data = da.compute(*lazy_data.values(), **kwargs)
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/base.py"", line 600, in compute
results = schedule(dsk, keys, **kwargs)
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/threaded.py"", line 81, in get
results = get_async(
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py"", line 508, in get_async
raise_exception(exc, tb)
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py"", line 316, in reraise
raise exc
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py"", line 221, in execute_task
result = _execute_task(task, data)
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py"", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py"", line 119, in
return func(*(_execute_task(a, cache) for a in args))
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py"", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/parallel.py"", line 285, in _wrapper
result = func(*converted_args, **kwargs)
File ""/home/lumbric/some_project/src/calculations.py"", line 100, in
lambda input_data: worker(input_data).compute().chunk({""time"": None}),
File ""/home/lumbric/some_project/src/calculations.py"", line 69, in worker
raise e
File ""/home/lumbric/some_project/src/calculations.py"", line 60, in worker
out = some_data * some_other_data.sel(some_dimension=input_data.some_dimension)
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 1329, in sel
ds = self._to_temp_dataset().sel(
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataset.py"", line 2502, in sel
pos_indexers, new_indexes = remap_label_indexers(
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/coordinates.py"", line 421, in remap_label_indexers
pos_indexers, new_indexes = indexing.remap_label_indexers(
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexing.py"", line 121, in remap_label_indexers
idxr, new_idx = index.query(labels, method=method, tolerance=tolerance)
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexes.py"", line 245, in query
indexer = get_indexer_nd(self.index, label, method, tolerance)
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexes.py"", line 142, in get_indexer_nd
flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance)
File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py"", line 3722, in get_indexer
raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
```
### Workaround
The issue does not occur if I use the synchronous dask scheduler by adding at the very beginning of my script:
```dask.config.set(scheduler='single-threaded')```
### Environment
INSTALLED VERSIONS
------------------
commit: None
python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-124-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1
xarray: 2022.3.0
pandas: 1.4.2
numpy: 1.22.4
scipy: 1.8.1
netCDF4: 1.5.8
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: None
iris: None
bottleneck: None
dask: 2022.05.2
distributed: 2022.5.2
matplotlib: 3.5.2
cartopy: None
seaborn: 0.11.2
numbagg: None
fsspec: 2022.5.0
cupy: None
pint: None
sparse: None
setuptools: 62.3.2
pip: 22.1.2
conda: 4.12.0
pytest: 7.1.2
IPython: 8.4.0
sphinx: None
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1315111684
https://github.com/pydata/xarray/issues/6816#issuecomment-1243882465,https://api.github.com/repos/pydata/xarray/issues/6816,1243882465,IC_kwDOAMm_X85KJCPh,691772,2022-09-12T15:07:45Z,2022-09-12T15:07:45Z,CONTRIBUTOR,"I think [these are the values](https://gist.github.com/lumbric/c100299d7ba4470c4d21bdabdd6a689f) of the index, the values seem to be unique and monotonic.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1315111684
https://github.com/pydata/xarray/issues/6816#issuecomment-1220519740,https://api.github.com/repos/pydata/xarray/issues/6816,1220519740,IC_kwDOAMm_X85Iv6c8,691772,2022-08-19T10:33:59Z,2022-08-19T10:33:59Z,CONTRIBUTOR,"Thanks a lot for your quick reply and your helpful hints!
In the meantime I have verified that: `d.c` is unique, i.e. `np.unqiue(d.c).size == d.c.size`
Unfortunately I was not able to reproduce the error often enough lately to test it with the synchronous scheduler nor to create a smaller synthetic example which reproduces the problem. One run takes about an hour until the exception occurs (or not), which makes things hard to debug. But I will continue trying and keep this ticket updated.
Any further suggestions very welcome :) Thanks a lot!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1315111684
https://github.com/pydata/xarray/issues/2186#issuecomment-1046665303,https://api.github.com/repos/pydata/xarray/issues/2186,1046665303,IC_kwDOAMm_X84-YthX,691772,2022-02-21T09:41:00Z,2022-02-21T09:41:00Z,CONTRIBUTOR,"I just stumbled across the same issue and created a minimal example similar to @lkilcher. I am using `xr.open_dataarray()` with chunks and do some simple computation. After that 800mb of RAM is used, no matter whether I close the file explicitly, delete the xarray objects or invoke the Python garbage collector.
What seems to work: do not use the `threading` Dask scheduler. The issue does not seem to occur with the single-threaded or processes scheduler. Also setting `MALLOC_MMAP_MAX_=40960` seems to solve the issue as suggested above (disclaimer: I don't fully understand the details here).
If I understand things correctly, this indicates that the issue is a consequence of dask/dask#3530. Not sure if there is anything to be fixed on the xarray side or what would be the best work around. I will try to use the processes scheduler.
I can create a new (xarray) ticket with all details about the minimal example, if anyone thinks that this might be helpful (to collect work-a-rounds or discuss fixes on the xarray side).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,326533369
https://github.com/pydata/xarray/issues/2928#issuecomment-510939525,https://api.github.com/repos/pydata/xarray/issues/2928,510939525,MDEyOklzc3VlQ29tbWVudDUxMDkzOTUyNQ==,691772,2019-07-12T15:56:28Z,2019-07-12T15:56:28Z,CONTRIBUTOR,"Fixed in 714ae8661a829d.
(Sorry for the delay... I actually prepared a PR but never finished it completely even it was such a simple thing.)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,438389323
https://github.com/pydata/xarray/issues/2928#issuecomment-487759590,https://api.github.com/repos/pydata/xarray/issues/2928,487759590,MDEyOklzc3VlQ29tbWVudDQ4Nzc1OTU5MA==,691772,2019-04-29T22:00:58Z,2019-04-29T22:00:58Z,CONTRIBUTOR,"> Any interest in putting together a PR?
Yes, can do so. When writing the report, I actually thought maybe preparing a PR is easier to write and to read than the ticket... :) In this case it really shouldn't be a big deal fixing it.
Maybe a bit off-topic: The thing I don't really understand and why I wanted to ask first: is there a clear paradigm about compatibility in the pydata universe? Despite its 0.x version number, I guess xarray tries to stay backward compatible regarding its public interface, right? When are the versions of dependencies increase? Simply motivated by need of new features in one of the dependent libraries?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,438389323
https://github.com/pydata/xarray/pull/2904#issuecomment-484239080,https://api.github.com/repos/pydata/xarray/issues/2904,484239080,MDEyOklzc3VlQ29tbWVudDQ4NDIzOTA4MA==,691772,2019-04-17T20:00:22Z,2019-04-17T20:00:22Z,CONTRIBUTOR,"Ah yes, true! I've confused something here. `dict()` accepts mappings, but not everything `dict()` accepts is a mapping. `xr.Dataset()` actually accepts only mappings. That makes actually things a bit easier and much clearer.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,434444058
https://github.com/pydata/xarray/pull/2904#issuecomment-484232306,https://api.github.com/repos/pydata/xarray/issues/2904,484232306,MDEyOklzc3VlQ29tbWVudDQ4NDIzMjMwNg==,691772,2019-04-17T19:39:42Z,2019-04-17T19:39:42Z,CONTRIBUTOR,"Hm yes, good error messages would be great, but I feel like it is widely accepted that in the scientific Python ecosystem error messages are hard to read quite often. Maybe this is the downside the duck typing? I've mentioned this only as explanation why I was so confused after running `xr.Dataset` for the first time.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,434444058
https://github.com/pydata/xarray/issues/1346#issuecomment-464338041,https://api.github.com/repos/pydata/xarray/issues/1346,464338041,MDEyOklzc3VlQ29tbWVudDQ2NDMzODA0MQ==,691772,2019-02-16T11:20:20Z,2019-02-16T11:20:20Z,CONTRIBUTOR,"Oh yes, of course! I've underestimated the low precision of float32 values above 2**24. Thanks for the hint.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,218459353
https://github.com/pydata/xarray/issues/1346#issuecomment-463324373,https://api.github.com/repos/pydata/xarray/issues/1346,463324373,MDEyOklzc3VlQ29tbWVudDQ2MzMyNDM3Mw==,691772,2019-02-13T19:02:52Z,2019-02-16T10:53:51Z,CONTRIBUTOR,"I think (!) xarray is not effected any longer, but pandas is. Bisecting the GIT history leads to commit 0b9ab2d1, which means that xarray >= v0.10.9 should not be affected. Uninstalling bottleneck is also a valid workaround.
Bottleneck's documentation explicitly mentions that [no error is raised in case of an overflow](https://kwgoodman.github.io/bottleneck-doc/reference.html?highlight=overflow#bottleneck.nanmean). But it seams to be very evil behavior, so it might be worth reporting upstream. What do you think? (I think kwgoodman/bottleneck#164 is something different, isn't it?)
**Edit:** this is not an overflow. It's a numerical error by not applying [pairwise summation](https://en.wikipedia.org/wiki/Pairwise_summation).
A couple of minimal examples:
```python
>>> import numpy as np
>>> import pandas as pd
>>> import xarray as xr
>>> import bottleneck as bn
>>> bn.nanmean(np.ones(2**25, dtype=np.float32))
0.5
>>> pd.Series(np.ones(2**25, dtype=np.float32)).mean()
0.5
>>> xr.DataArray(np.ones(2**25, dtype=np.float32)).mean() # not affected for this version
array(1., dtype=float32)
```
Done with the following versions:
```bash
$ pip3 freeze
Bottleneck==1.2.1
numpy==1.16.1
pandas==0.24.1
xarray==0.11.3
...
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,218459353
https://github.com/pydata/xarray/issues/1346#issuecomment-464016154,https://api.github.com/repos/pydata/xarray/issues/1346,464016154,MDEyOklzc3VlQ29tbWVudDQ2NDAxNjE1NA==,691772,2019-02-15T11:41:36Z,2019-02-15T11:41:36Z,CONTRIBUTOR,"Oh hm, I think I didn't really understand what happens in `bottleneck.nanmean()`. I understand that integers can overflow and that float32 have varying absolute precision. The max float32 3.4E+38 is not hit here. So how can the mean of a list of ones be 0.5?
Isn't this what bottleneck is doing? Summing up a bunch of float32 values and then dividing by the length?
```
>>> d = np.ones(2**25, dtype=np.float32)
>>> d.sum()/np.float32(len(d))
1.0
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,218459353