html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/7059#issuecomment-1268031159,https://api.github.com/repos/pydata/xarray/issues/7059,1268031159,IC_kwDOAMm_X85LlJ63,691772,2022-10-05T07:02:23Z,2022-10-05T07:02:48Z,CONTRIBUTOR,"> I agree with just passing all args explicitly. Does it work otherwise with `""processes""`? What do you mean by that? > 1. Why are you chunking iniside the mapped function? Uhm yes, you are right, this should be removed, not sure how this happened. Removing `.chunk({""time"": None})` in the lambda function does not change the behavior of the example regarding this issue. > 2. If you `conda install flox`, the resample operation should be quite efficient, without the need to use `map_blocks` Oh wow, thanks! Haven't seen flox before.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1379372915 https://github.com/pydata/xarray/issues/7059#issuecomment-1254873700,https://api.github.com/repos/pydata/xarray/issues/7059,1254873700,IC_kwDOAMm_X85Ky9pk,691772,2022-09-22T11:09:16Z,2022-09-22T11:09:16Z,CONTRIBUTOR,"I have managed to reduce the reproducing example (see ""Minimal Complete Verifiable Example 2"" above) and then also find a proper solution to fix this issue. I am still not sure whether this is a bug or intended behavior, so I'll won't close the issue for now. Basically the issue occurs when a chunked NetCDF file is loaded from disk, passed to `xarray.map_blocks()` and is then used in `.sel()` as parameter to get a subset of some other xarray object which is not passed to the worker `func()`. I think the proper solution is to use the `args` parameter of `map_blocks()` instead of `.sel()`: ``` --- run-broken.py 2022-09-22 13:00:41.095555961 +0200 +++ run.py 2022-09-22 13:01:14.452696511 +0200 @@ -30,17 +30,17 @@ def resample_annually(data): return data.sortby(""time"").resample(time=""1A"", label=""left"", loffset=""1D"").mean(dim=""time"") - def worker(data): - locations_chunk = locations.sel(locations=data.locations) - out_raw = data * locations_chunk + def worker(data, locations): + out_raw = data * locations out = resample_annually(out_raw) return out template = resample_annually(data) out = xr.map_blocks( - lambda data: worker(data).compute().chunk({""time"": None}), + lambda data, locations: worker(data, locations).compute().chunk({""time"": None}), data, + (locations,), template=template, ) ``` This seems to fix this issue and seems to be the proper solution anyway. I still don't see why I am not allowed to use `.sel()` on shadowed objects in the worker `func()ยด. Is this on purpose? If yes, should we add something to the documentation? Is this a specific behavior of `map_blocks()`? Is it related to #6904?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1379372915 https://github.com/pydata/xarray/issues/7059#issuecomment-1252561840,https://api.github.com/repos/pydata/xarray/issues/7059,1252561840,IC_kwDOAMm_X85KqJOw,691772,2022-09-20T15:54:48Z,2022-09-20T15:54:48Z,CONTRIBUTOR,"@benbovy thanks for the hint! I tried passing an explicit lock to `xr.open_mfdataset()` [as suggested](https://github.com/pydata/xarray/issues/6904#issuecomment-1210233503), but didn't change anything, still the same exception. I will double check, if I did it the right way, I might be missing something.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1379372915 https://github.com/pydata/xarray/issues/6816#issuecomment-1243864752,https://api.github.com/repos/pydata/xarray/issues/6816,1243864752,IC_kwDOAMm_X85KI96w,691772,2022-09-12T14:55:06Z,2022-09-13T09:39:48Z,CONTRIBUTOR,"Not sure what changed, but now I do get the same error also with my small and synthetic test data. This way I was able to debug a bit further. I am pretty sure this is a bug in xarray or pandas. I think something in **`pandas.core.indexes.base.Index` is not thread-safe**. At least this seems to be the place of the race condition. I can create a new ticket, if you prefer, but since I am not sure in which project, I will continue to collect information here. Unfortunately I have not yet managed to create a minimal example as this is quite tricky with timing issues. ### Additional debugging print and proof of race condition If I add the following debugging print to the pandas code: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,6 @@ self._check_indexing_method(method, limit, tolerance) if not self._index_as_unique: + print(""Original: "", len(self), "", length of set:"", len(set(self))) raise InvalidIndexError(self._requires_unique_msg) if len(target) == 0 ``` ...I get the following output: ``` Original: 3879 , length of set: 3879 ``` So the index seems to be unique, but `self.is_unique` is `False` for some reason (note that `not self._index_as_unique` and `self.is_unique` is the same in this case). To confirm that the race condition is at this point we wait for 1s and then check again for uniqueness: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,10 @@ self._check_indexing_method(method, limit, tolerance) if not self._index_as_unique: + if not self.is_unique: + import time + time.sleep(1) + print(""now unique?"", self.is_unique) raise InvalidIndexError(self._requires_unique_msg) ``` This outputs: ``` now unique? True ``` ### Traceback ``` Traceback (most recent call last): File ""scripts/my_script.py"", line 57, in main() File ""scripts/my_script.py"", line 48, in main my_function( File ""/home/lumbric/my_project/src/calculations.py"", line 136, in my_function result = result.compute() File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 947, in compute return new.load(**kwargs) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 921, in load ds = self._to_temp_dataset().load(**kwargs) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataset.py"", line 861, in load evaluated_data = da.compute(*lazy_data.values(), **kwargs) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/base.py"", line 600, in compute results = schedule(dsk, keys, **kwargs) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/threaded.py"", line 81, in get results = get_async( File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py"", line 508, in get_async raise_exception(exc, tb) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py"", line 316, in reraise raise exc File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py"", line 221, in execute_task result = _execute_task(task, data) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py"", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py"", line 119, in return func(*(_execute_task(a, cache) for a in args)) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py"", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/parallel.py"", line 285, in _wrapper result = func(*converted_args, **kwargs) File ""/home/lumbric/some_project/src/calculations.py"", line 100, in lambda input_data: worker(input_data).compute().chunk({""time"": None}), File ""/home/lumbric/some_project/src/calculations.py"", line 69, in worker raise e File ""/home/lumbric/some_project/src/calculations.py"", line 60, in worker out = some_data * some_other_data.sel(some_dimension=input_data.some_dimension) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py"", line 1329, in sel ds = self._to_temp_dataset().sel( File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataset.py"", line 2502, in sel pos_indexers, new_indexes = remap_label_indexers( File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/coordinates.py"", line 421, in remap_label_indexers pos_indexers, new_indexes = indexing.remap_label_indexers( File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexing.py"", line 121, in remap_label_indexers idxr, new_idx = index.query(labels, method=method, tolerance=tolerance) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexes.py"", line 245, in query indexer = get_indexer_nd(self.index, label, method, tolerance) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexes.py"", line 142, in get_indexer_nd flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance) File ""/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py"", line 3722, in get_indexer raise InvalidIndexError(self._requires_unique_msg) pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects ``` ### Workaround The issue does not occur if I use the synchronous dask scheduler by adding at the very beginning of my script: ```dask.config.set(scheduler='single-threaded')``` ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 5.4.0-124-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.22.4 scipy: 1.8.1 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: None dask: 2022.05.2 distributed: 2022.5.2 matplotlib: 3.5.2 cartopy: None seaborn: 0.11.2 numbagg: None fsspec: 2022.5.0 cupy: None pint: None sparse: None setuptools: 62.3.2 pip: 22.1.2 conda: 4.12.0 pytest: 7.1.2 IPython: 8.4.0 sphinx: None
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1315111684 https://github.com/pydata/xarray/issues/6816#issuecomment-1243882465,https://api.github.com/repos/pydata/xarray/issues/6816,1243882465,IC_kwDOAMm_X85KJCPh,691772,2022-09-12T15:07:45Z,2022-09-12T15:07:45Z,CONTRIBUTOR,"I think [these are the values](https://gist.github.com/lumbric/c100299d7ba4470c4d21bdabdd6a689f) of the index, the values seem to be unique and monotonic.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1315111684 https://github.com/pydata/xarray/issues/6816#issuecomment-1220519740,https://api.github.com/repos/pydata/xarray/issues/6816,1220519740,IC_kwDOAMm_X85Iv6c8,691772,2022-08-19T10:33:59Z,2022-08-19T10:33:59Z,CONTRIBUTOR,"Thanks a lot for your quick reply and your helpful hints! In the meantime I have verified that: `d.c` is unique, i.e. `np.unqiue(d.c).size == d.c.size` Unfortunately I was not able to reproduce the error often enough lately to test it with the synchronous scheduler nor to create a smaller synthetic example which reproduces the problem. One run takes about an hour until the exception occurs (or not), which makes things hard to debug. But I will continue trying and keep this ticket updated. Any further suggestions very welcome :) Thanks a lot!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1315111684 https://github.com/pydata/xarray/issues/2186#issuecomment-1046665303,https://api.github.com/repos/pydata/xarray/issues/2186,1046665303,IC_kwDOAMm_X84-YthX,691772,2022-02-21T09:41:00Z,2022-02-21T09:41:00Z,CONTRIBUTOR,"I just stumbled across the same issue and created a minimal example similar to @lkilcher. I am using `xr.open_dataarray()` with chunks and do some simple computation. After that 800mb of RAM is used, no matter whether I close the file explicitly, delete the xarray objects or invoke the Python garbage collector. What seems to work: do not use the `threading` Dask scheduler. The issue does not seem to occur with the single-threaded or processes scheduler. Also setting `MALLOC_MMAP_MAX_=40960` seems to solve the issue as suggested above (disclaimer: I don't fully understand the details here). If I understand things correctly, this indicates that the issue is a consequence of dask/dask#3530. Not sure if there is anything to be fixed on the xarray side or what would be the best work around. I will try to use the processes scheduler. I can create a new (xarray) ticket with all details about the minimal example, if anyone thinks that this might be helpful (to collect work-a-rounds or discuss fixes on the xarray side).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,326533369 https://github.com/pydata/xarray/issues/2928#issuecomment-510939525,https://api.github.com/repos/pydata/xarray/issues/2928,510939525,MDEyOklzc3VlQ29tbWVudDUxMDkzOTUyNQ==,691772,2019-07-12T15:56:28Z,2019-07-12T15:56:28Z,CONTRIBUTOR,"Fixed in 714ae8661a829d. (Sorry for the delay... I actually prepared a PR but never finished it completely even it was such a simple thing.)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,438389323 https://github.com/pydata/xarray/issues/2928#issuecomment-487759590,https://api.github.com/repos/pydata/xarray/issues/2928,487759590,MDEyOklzc3VlQ29tbWVudDQ4Nzc1OTU5MA==,691772,2019-04-29T22:00:58Z,2019-04-29T22:00:58Z,CONTRIBUTOR,"> Any interest in putting together a PR? Yes, can do so. When writing the report, I actually thought maybe preparing a PR is easier to write and to read than the ticket... :) In this case it really shouldn't be a big deal fixing it. Maybe a bit off-topic: The thing I don't really understand and why I wanted to ask first: is there a clear paradigm about compatibility in the pydata universe? Despite its 0.x version number, I guess xarray tries to stay backward compatible regarding its public interface, right? When are the versions of dependencies increase? Simply motivated by need of new features in one of the dependent libraries?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,438389323 https://github.com/pydata/xarray/pull/2904#issuecomment-484239080,https://api.github.com/repos/pydata/xarray/issues/2904,484239080,MDEyOklzc3VlQ29tbWVudDQ4NDIzOTA4MA==,691772,2019-04-17T20:00:22Z,2019-04-17T20:00:22Z,CONTRIBUTOR,"Ah yes, true! I've confused something here. `dict()` accepts mappings, but not everything `dict()` accepts is a mapping. `xr.Dataset()` actually accepts only mappings. That makes actually things a bit easier and much clearer.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,434444058 https://github.com/pydata/xarray/pull/2904#issuecomment-484232306,https://api.github.com/repos/pydata/xarray/issues/2904,484232306,MDEyOklzc3VlQ29tbWVudDQ4NDIzMjMwNg==,691772,2019-04-17T19:39:42Z,2019-04-17T19:39:42Z,CONTRIBUTOR,"Hm yes, good error messages would be great, but I feel like it is widely accepted that in the scientific Python ecosystem error messages are hard to read quite often. Maybe this is the downside the duck typing? I've mentioned this only as explanation why I was so confused after running `xr.Dataset` for the first time.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,434444058 https://github.com/pydata/xarray/issues/1346#issuecomment-464338041,https://api.github.com/repos/pydata/xarray/issues/1346,464338041,MDEyOklzc3VlQ29tbWVudDQ2NDMzODA0MQ==,691772,2019-02-16T11:20:20Z,2019-02-16T11:20:20Z,CONTRIBUTOR,"Oh yes, of course! I've underestimated the low precision of float32 values above 2**24. Thanks for the hint.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,218459353 https://github.com/pydata/xarray/issues/1346#issuecomment-463324373,https://api.github.com/repos/pydata/xarray/issues/1346,463324373,MDEyOklzc3VlQ29tbWVudDQ2MzMyNDM3Mw==,691772,2019-02-13T19:02:52Z,2019-02-16T10:53:51Z,CONTRIBUTOR,"I think (!) xarray is not effected any longer, but pandas is. Bisecting the GIT history leads to commit 0b9ab2d1, which means that xarray >= v0.10.9 should not be affected. Uninstalling bottleneck is also a valid workaround. Bottleneck's documentation explicitly mentions that [no error is raised in case of an overflow](https://kwgoodman.github.io/bottleneck-doc/reference.html?highlight=overflow#bottleneck.nanmean). But it seams to be very evil behavior, so it might be worth reporting upstream. What do you think? (I think kwgoodman/bottleneck#164 is something different, isn't it?) **Edit:** this is not an overflow. It's a numerical error by not applying [pairwise summation](https://en.wikipedia.org/wiki/Pairwise_summation). A couple of minimal examples: ```python >>> import numpy as np >>> import pandas as pd >>> import xarray as xr >>> import bottleneck as bn >>> bn.nanmean(np.ones(2**25, dtype=np.float32)) 0.5 >>> pd.Series(np.ones(2**25, dtype=np.float32)).mean() 0.5 >>> xr.DataArray(np.ones(2**25, dtype=np.float32)).mean() # not affected for this version array(1., dtype=float32) ``` Done with the following versions: ```bash $ pip3 freeze Bottleneck==1.2.1 numpy==1.16.1 pandas==0.24.1 xarray==0.11.3 ... ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,218459353 https://github.com/pydata/xarray/issues/1346#issuecomment-464016154,https://api.github.com/repos/pydata/xarray/issues/1346,464016154,MDEyOklzc3VlQ29tbWVudDQ2NDAxNjE1NA==,691772,2019-02-15T11:41:36Z,2019-02-15T11:41:36Z,CONTRIBUTOR,"Oh hm, I think I didn't really understand what happens in `bottleneck.nanmean()`. I understand that integers can overflow and that float32 have varying absolute precision. The max float32 3.4E+38 is not hit here. So how can the mean of a list of ones be 0.5? Isn't this what bottleneck is doing? Summing up a bunch of float32 values and then dividing by the length? ``` >>> d = np.ones(2**25, dtype=np.float32) >>> d.sum()/np.float32(len(d)) 1.0 ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,218459353