github: issue_comments: 14 rows where user = 691772 sorted by updated

14 rows where user = 691772 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
1268031159	https://github.com/pydata/xarray/issues/7059#issuecomment-1268031159	https://api.github.com/repos/pydata/xarray/issues/7059	IC_kwDOAMm_X85LlJ63	lumbric 691772	2022-10-05T07:02:23Z	2022-10-05T07:02:48Z	CONTRIBUTOR	I agree with just passing all args explicitly. Does it work otherwise with `"processes"`? What do you mean by that? Why are you chunking iniside the mapped function? Uhm yes, you are right, this should be removed, not sure how this happened. Removing `.chunk({"time": None})` in the lambda function does not change the behavior of the example regarding this issue. If you `conda install flox`, the resample operation should be quite efficient, without the need to use `map_blocks` Oh wow, thanks! Haven't seen flox before.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	pandas.errors.InvalidIndexError raised when running computation in parallel using dask 1379372915
1254873700	https://github.com/pydata/xarray/issues/7059#issuecomment-1254873700	https://api.github.com/repos/pydata/xarray/issues/7059	IC_kwDOAMm_X85Ky9pk	lumbric 691772	2022-09-22T11:09:16Z	2022-09-22T11:09:16Z	CONTRIBUTOR	I have managed to reduce the reproducing example (see "Minimal Complete Verifiable Example 2" above) and then also find a proper solution to fix this issue. I am still not sure whether this is a bug or intended behavior, so I'll won't close the issue for now. Basically the issue occurs when a chunked NetCDF file is loaded from disk, passed to `xarray.map_blocks()` and is then used in `.sel()` as parameter to get a subset of some other xarray object which is not passed to the worker `func()`. I think the proper solution is to use the `args` parameter of `map_blocks()` instead of `.sel()`: ``` --- run-broken.py 2022-09-22 13:00:41.095555961 +0200 +++ run.py 2022-09-22 13:01:14.452696511 +0200 @@ -30,17 +30,17 @@ def resample_annually(data): return data.sortby("time").resample(time="1A", label="left", loffset="1D").mean(dim="time") def worker(data): locations_chunk = locations.sel(locations=data.locations) out_raw = data * locations_chunk def worker(data, locations): out_raw = data * locations out = resample_annually(out_raw) return out template = resample_annually(data) out = xr.map_blocks( - lambda data: worker(data).compute().chunk({"time": None}), + lambda data, locations: worker(data, locations).compute().chunk({"time": None}), data, + (locations,), template=template, ) ``` This seems to fix this issue and seems to be the proper solution anyway. I still don't see why I am not allowed to use `.sel()` on shadowed objects in the worker `func()´. Is this on purpose? If yes, should we add something to the documentation? Is this a specific behavior of`map_blocks()`? Is it related to #6904?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	pandas.errors.InvalidIndexError raised when running computation in parallel using dask 1379372915
1252561840	https://github.com/pydata/xarray/issues/7059#issuecomment-1252561840	https://api.github.com/repos/pydata/xarray/issues/7059	IC_kwDOAMm_X85KqJOw	lumbric 691772	2022-09-20T15:54:48Z	2022-09-20T15:54:48Z	CONTRIBUTOR	@benbovy thanks for the hint! I tried passing an explicit lock to `xr.open_mfdataset()` as suggested, but didn't change anything, still the same exception. I will double check, if I did it the right way, I might be missing something.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	pandas.errors.InvalidIndexError raised when running computation in parallel using dask 1379372915
1243864752	https://github.com/pydata/xarray/issues/6816#issuecomment-1243864752	https://api.github.com/repos/pydata/xarray/issues/6816	IC_kwDOAMm_X85KI96w	lumbric 691772	2022-09-12T14:55:06Z	2022-09-13T09:39:48Z	CONTRIBUTOR	Not sure what changed, but now I do get the same error also with my small and synthetic test data. This way I was able to debug a bit further. I am pretty sure this is a bug in xarray or pandas. I think something in `pandas.core.indexes.base.Index` is not thread-safe. At least this seems to be the place of the race condition. I can create a new ticket, if you prefer, but since I am not sure in which project, I will continue to collect information here. Unfortunately I have not yet managed to create a minimal example as this is quite tricky with timing issues. Additional debugging print and proof of race condition If I add the following debugging print to the pandas code: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,6 @@ self._check_indexing_method(method, limit, tolerance) `if not self._index_as_unique:` print("Original: ", len(self), ", length of set:", len(set(self))) raise InvalidIndexError(self._requires_unique_msg) `if len(target) == 0` ``` ...I get the following output: `Original: 3879 , length of set: 3879` So the index seems to be unique, but `self.is_unique` is `False` for some reason (note that `not self._index_as_unique` and `self.is_unique` is the same in this case). To confirm that the race condition is at this point we wait for 1s and then check again for uniqueness: ``` --- /tmp/base.py 2022-09-12 16:35:53.739971953 +0200 +++ /home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py 2022-09-12 16:35:58.864144801 +0200 @@ -3718,7 +3718,10 @@ self._check_indexing_method(method, limit, tolerance) `if not self._index_as_unique:` if not self.is_unique: import time time.sleep(1) print("now unique?", self.is_unique) raise InvalidIndexError(self._requires_unique_msg) ``` This outputs: `now unique? True` Traceback Traceback (most recent call last): File "scripts/my_script.py", line 57, in <module> main() File "scripts/my_script.py", line 48, in main my_function( File "/home/lumbric/my_project/src/calculations.py", line 136, in my_function result = result.compute() File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py", line 947, in compute return new.load(kwargs) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py", line 921, in load ds = self._to_temp_dataset().load(kwargs) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataset.py", line 861, in load evaluated_data = da.compute(lazy_data.values(), kwargs) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/base.py", line 600, in compute results = schedule(dsk, keys, kwargs) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/threaded.py", line 81, in get results = get_async( File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py", line 508, in get_async raise_exception(exc, tb) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py", line 316, in reraise raise exc File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/local.py", line 221, in execute_task result = _execute_task(task, data) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task return func((_execute_task(a, cache) for a in args)) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py", line 119, in <genexpr> return func((_execute_task(a, cache) for a in args)) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task return func((_execute_task(a, cache) for a in args)) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/parallel.py", line 285, in _wrapper result = func(converted_args, kwargs) File "/home/lumbric/some_project/src/calculations.py", line 100, in <lambda> lambda input_data: worker(input_data).compute().chunk({"time": None}), File "/home/lumbric/some_project/src/calculations.py", line 69, in worker raise e File "/home/lumbric/some_project/src/calculations.py", line 60, in worker out = some_data some_other_data.sel(some_dimension=input_data.some_dimension) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataarray.py", line 1329, in sel ds = self._to_temp_dataset().sel( File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/dataset.py", line 2502, in sel pos_indexers, new_indexes = remap_label_indexers( File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/coordinates.py", line 421, in remap_label_indexers pos_indexers, new_indexes = indexing.remap_label_indexers( File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexing.py", line 121, in remap_label_indexers idxr, new_idx = index.query(labels, method=method, tolerance=tolerance) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexes.py", line 245, in query indexer = get_indexer_nd(self.index, label, method, tolerance) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/xarray/core/indexes.py", line 142, in get_indexer_nd flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance) File "/home/lumbric/.conda/envs/my_project/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3722, in get_indexer raise InvalidIndexError(self._requires_unique_msg) pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects Workaround The issue does not occur if I use the synchronous dask scheduler by adding at the very beginning of my script: `dask.config.set(scheduler='single-threaded')` Environment INSTALLED VERSIONS ------------------ commit: None python: 3.8.13 \| packaged by conda-forge \| (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 5.4.0-124-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.22.4 scipy: 1.8.1 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: None dask: 2022.05.2 distributed: 2022.5.2 matplotlib: 3.5.2 cartopy: None seaborn: 0.11.2 numbagg: None fsspec: 2022.5.0 cupy: None pint: None sparse: None setuptools: 62.3.2 pip: 22.1.2 conda: 4.12.0 pytest: 7.1.2 IPython: 8.4.0 sphinx: None	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks() 1315111684
1243882465	https://github.com/pydata/xarray/issues/6816#issuecomment-1243882465	https://api.github.com/repos/pydata/xarray/issues/6816	IC_kwDOAMm_X85KJCPh	lumbric 691772	2022-09-12T15:07:45Z	2022-09-12T15:07:45Z	CONTRIBUTOR	I think these are the values of the index, the values seem to be unique and monotonic.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks() 1315111684
1220519740	https://github.com/pydata/xarray/issues/6816#issuecomment-1220519740	https://api.github.com/repos/pydata/xarray/issues/6816	IC_kwDOAMm_X85Iv6c8	lumbric 691772	2022-08-19T10:33:59Z	2022-08-19T10:33:59Z	CONTRIBUTOR	Thanks a lot for your quick reply and your helpful hints! In the meantime I have verified that: `d.c` is unique, i.e. `np.unqiue(d.c).size == d.c.size` Unfortunately I was not able to reproduce the error often enough lately to test it with the synchronous scheduler nor to create a smaller synthetic example which reproduces the problem. One run takes about an hour until the exception occurs (or not), which makes things hard to debug. But I will continue trying and keep this ticket updated. Any further suggestions very welcome :) Thanks a lot!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	pandas.errors.InvalidIndexError is raised in some runs when using chunks and map_blocks() 1315111684
1046665303	https://github.com/pydata/xarray/issues/2186#issuecomment-1046665303	https://api.github.com/repos/pydata/xarray/issues/2186	IC_kwDOAMm_X84-YthX	lumbric 691772	2022-02-21T09:41:00Z	2022-02-21T09:41:00Z	CONTRIBUTOR	I just stumbled across the same issue and created a minimal example similar to @lkilcher. I am using `xr.open_dataarray()` with chunks and do some simple computation. After that 800mb of RAM is used, no matter whether I close the file explicitly, delete the xarray objects or invoke the Python garbage collector. What seems to work: do not use the `threading` Dask scheduler. The issue does not seem to occur with the single-threaded or processes scheduler. Also setting `MALLOC_MMAP_MAX_=40960` seems to solve the issue as suggested above (disclaimer: I don't fully understand the details here). If I understand things correctly, this indicates that the issue is a consequence of dask/dask#3530. Not sure if there is anything to be fixed on the xarray side or what would be the best work around. I will try to use the processes scheduler. I can create a new (xarray) ticket with all details about the minimal example, if anyone thinks that this might be helpful (to collect work-a-rounds or discuss fixes on the xarray side).	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Memory leak while looping through a Dataset 326533369
510939525	https://github.com/pydata/xarray/issues/2928#issuecomment-510939525	https://api.github.com/repos/pydata/xarray/issues/2928	MDEyOklzc3VlQ29tbWVudDUxMDkzOTUyNQ==	lumbric 691772	2019-07-12T15:56:28Z	2019-07-12T15:56:28Z	CONTRIBUTOR	Fixed in 714ae8661a829d. (Sorry for the delay... I actually prepared a PR but never finished it completely even it was such a simple thing.)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Dask outputs warning: "The da.atop function has moved to da.blockwise" 438389323
487759590	https://github.com/pydata/xarray/issues/2928#issuecomment-487759590	https://api.github.com/repos/pydata/xarray/issues/2928	MDEyOklzc3VlQ29tbWVudDQ4Nzc1OTU5MA==	lumbric 691772	2019-04-29T22:00:58Z	2019-04-29T22:00:58Z	CONTRIBUTOR	Any interest in putting together a PR? Yes, can do so. When writing the report, I actually thought maybe preparing a PR is easier to write and to read than the ticket... :) In this case it really shouldn't be a big deal fixing it. Maybe a bit off-topic: The thing I don't really understand and why I wanted to ask first: is there a clear paradigm about compatibility in the pydata universe? Despite its 0.x version number, I guess xarray tries to stay backward compatible regarding its public interface, right? When are the versions of dependencies increase? Simply motivated by need of new features in one of the dependent libraries?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Dask outputs warning: "The da.atop function has moved to da.blockwise" 438389323
484239080	https://github.com/pydata/xarray/pull/2904#issuecomment-484239080	https://api.github.com/repos/pydata/xarray/issues/2904	MDEyOklzc3VlQ29tbWVudDQ4NDIzOTA4MA==	lumbric 691772	2019-04-17T20:00:22Z	2019-04-17T20:00:22Z	CONTRIBUTOR	Ah yes, true! I've confused something here. `dict()` accepts mappings, but not everything `dict()` accepts is a mapping. `xr.Dataset()` actually accepts only mappings. That makes actually things a bit easier and much clearer.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Minor improvement of docstring for Dataset 434444058
484232306	https://github.com/pydata/xarray/pull/2904#issuecomment-484232306	https://api.github.com/repos/pydata/xarray/issues/2904	MDEyOklzc3VlQ29tbWVudDQ4NDIzMjMwNg==	lumbric 691772	2019-04-17T19:39:42Z	2019-04-17T19:39:42Z	CONTRIBUTOR	Hm yes, good error messages would be great, but I feel like it is widely accepted that in the scientific Python ecosystem error messages are hard to read quite often. Maybe this is the downside the duck typing? I've mentioned this only as explanation why I was so confused after running `xr.Dataset` for the first time.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Minor improvement of docstring for Dataset 434444058
464338041	https://github.com/pydata/xarray/issues/1346#issuecomment-464338041	https://api.github.com/repos/pydata/xarray/issues/1346	MDEyOklzc3VlQ29tbWVudDQ2NDMzODA0MQ==	lumbric 691772	2019-02-16T11:20:20Z	2019-02-16T11:20:20Z	CONTRIBUTOR	Oh yes, of course! I've underestimated the low precision of float32 values above 2**24. Thanks for the hint.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	bottleneck : Wrong mean for float32 array 218459353
463324373	https://github.com/pydata/xarray/issues/1346#issuecomment-463324373	https://api.github.com/repos/pydata/xarray/issues/1346	MDEyOklzc3VlQ29tbWVudDQ2MzMyNDM3Mw==	lumbric 691772	2019-02-13T19:02:52Z	2019-02-16T10:53:51Z	CONTRIBUTOR	I think (!) xarray is not effected any longer, but pandas is. Bisecting the GIT history leads to commit 0b9ab2d1, which means that xarray >= v0.10.9 should not be affected. Uninstalling bottleneck is also a valid workaround. <s>Bottleneck's documentation explicitly mentions that no error is raised in case of an overflow. But it seams to be very evil behavior, so it might be worth reporting upstream.</s> What do you think? (I think kwgoodman/bottleneck#164 is something different, isn't it?) Edit: this is not an overflow. It's a numerical error by not applying pairwise summation. A couple of minimal examples: ```python import numpy as np import pandas as pd import xarray as xr import bottleneck as bn bn.nanmean(np.ones(225, dtype=np.float32)) 0.5 pd.Series(np.ones(225, dtype=np.float32)).mean() 0.5 xr.DataArray(np.ones(2**25, dtype=np.float32)).mean() # not affected for this version <xarray.DataArray ()> array(1., dtype=float32) ``` Done with the following versions: `bash $ pip3 freeze Bottleneck==1.2.1 numpy==1.16.1 pandas==0.24.1 xarray==0.11.3 ...`	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	bottleneck : Wrong mean for float32 array 218459353
464016154	https://github.com/pydata/xarray/issues/1346#issuecomment-464016154	https://api.github.com/repos/pydata/xarray/issues/1346	MDEyOklzc3VlQ29tbWVudDQ2NDAxNjE1NA==	lumbric 691772	2019-02-15T11:41:36Z	2019-02-15T11:41:36Z	CONTRIBUTOR	Oh hm, I think I didn't really understand what happens in `bottleneck.nanmean()`. I understand that integers can overflow and that float32 have varying absolute precision. The max float32 3.4E+38 is not hit here. So how can the mean of a list of ones be 0.5? Isn't this what bottleneck is doing? Summing up a bunch of float32 values and then dividing by the length? ``` d = np.ones(2**25, dtype=np.float32) d.sum()/np.float32(len(d)) 1.0 ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	bottleneck : Wrong mean for float32 array 218459353

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

14 rows where user = 691772 sorted by updated_at descending

Additional debugging print and proof of race condition

Traceback

Workaround

Environment

Advanced export