issues: 1173497454
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1173497454 | I_kwDOAMm_X85F8iZu | 6377 | [FEATURE]: Add a replace method | 13662783 | open | 0 | 8 | 2022-03-18T11:46:37Z | 2023-06-25T07:52:46Z | CONTRIBUTOR | Is your feature request related to a problem?If I have a DataArray of values:
There's no easy way like pandas (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) to do this. (Apologies if I've missed related issues, searching for "replace" gives many hits as the word is obviously used quite often.) Describe the solution you'd like
I've had a try at a relatively efficient implementation below. I'm wondering whether it's a worthwhile addition to xarray? Describe alternatives you've consideredIgnoring issues such as dealing with NaNs, chunks, etc., a simple dict lookup:
Alternatively, leveraging pandas:
But I also tried my hand at a custom implementation, letting
``` Such an approach seems like it's consistently the fastest: ```python da = xr.DataArray(np.random.randint(0, 100, 100_000)) to_replace = np.random.choice(np.arange(100), 10, replace=False) value = to_replace * 200 test1 = custom_replace(da, to_replace, value) test2 = pandas_replace(da, to_replace, value) test3 = dict_replace(da, to_replace, value) assert test1.equals(test2) assert test1.equals(test3) 6.93 ms ± 295 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%timeit custom_replace(da, to_replace, value) 9.37 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%timeit pandas_replace(da, to_replace, value) 26.8 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit dict_replace(da, to_replace, value) ``` With the advantage growing the number of values involved: ```python da = xr.DataArray(np.random.randint(0, 10_000, 100_000)) to_replace = np.random.choice(np.arange(10_000), 10_000, replace=False) value = to_replace * 200 test1 = custom_replace(da, to_replace, value) test2 = pandas_replace(da, to_replace, value) test3 = dict_replace(da, to_replace, value) assert test1.equals(test2) assert test1.equals(test3) 21.6 ms ± 990 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit custom_replace(da, to_replace, value) 3.12 s ± 574 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit pandas_replace(da, to_replace, value) 42.7 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit dict_replace(da, to_replace, value) ``` In my real-life example, with a DataArray of approx 110 000 elements, with 60 000 values to replace, the custom one takes 33 ms, the dict one takes 135 ms, while pandas takes 26 s (!). Additional contextIn all cases, we need dealing with NaNs, checking the input, etc.: ```python def replace(da: xr.DataArray, to_replace: Any, value: Any): from xarray.core.utils import is_scalar
``` It think it should be easy to use e.g. let it operate on the numpy arrays so e.g. apply_ufunc will work. The primary issue is whether values can be sorted; in such a case the dict lookup might be an okay fallback? I've had a peek at the pandas implementation, but didn't become much wiser. Anyway, for your consideration! I'd be happy to submit a PR. |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/6377/reactions", "total_count": 9, "+1": 9, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
13221727 | issue |