issues: 1173497454

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1173497454	I_kwDOAMm_X85F8iZu	6377	[FEATURE]: Add a replace method	13662783	open	0			8	2022-03-18T11:46:37Z	2023-06-25T07:52:46Z		CONTRIBUTOR				Is your feature request related to a problem? If I have a DataArray of values: `python da = xr.DataArray([0, 1, 2, 3, 4, 5])` And I'd like to replace `to_replace=[1, 3, 5]` by `value=[10, 30, 50]`, there's no method `da.replace(to_replace, value)` to do this. There's no easy way like pandas (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) to do this. (Apologies if I've missed related issues, searching for "replace" gives many hits as the word is obviously used quite often.) Describe the solution you'd like `python da = xr.DataArray([0, 1, 2, 3, 4, 5]) replaced = da.replace([1, 3, 5], [10, 30, 50]) print(replaced)` `<xarray.DataArray (dim_0: 6)> array([ 0, 10, 2, 30, 4, 50]) Dimensions without coordinates: dim_0` I've had a try at a relatively efficient implementation below. I'm wondering whether it's a worthwhile addition to xarray? Describe alternatives you've considered Ignoring issues such as dealing with NaNs, chunks, etc., a simple dict lookup: `python def dict_replace(da, to_replace, value): d = {k: v for k, v in zip(to_replace, value)} out = np.vectorize(lambda x: d.get(x, x))(da.values) return da.copy(data=out)` Alternatively, leveraging pandas: `python def pandas_replace(da, to_replace, value): df = pd.DataFrame() df["values"] = da.values.ravel() df["values"].replace(to_replace, value, inplace=True) return da.copy(data=df["values"].values.reshape(da.shape))` But I also tried my hand at a custom implementation, letting `np.unique` do the heavy lifting: ```python def custom_replace(da, to_replace, value): # Use np.unique to create an inverse index flat = da.values.ravel() uniques, index = np.unique(flat, return_inverse=True) replaceable = np.isin(flat, to_replace) # Create a replacement array in which there is a 1:1 relation between # uniques and the replacement values, so that we can use the inverse index # to select replacement values. valid = np.isin(to_replace, uniques, assume_unique=True) # Remove to_replace values that are not present in da. If no overlap # exists between to_replace and the values in da, just return a copy. if not valid.any(): return da.copy() to_replace = to_replace[valid] value = value[valid] replacement = np.zeros_like(uniques) replacement[np.searchsorted(uniques, to_replace)] = value out = flat.copy() out[replaceable] = replacement[index[replaceable]] return da.copy(data=out.reshape(da.shape)) ``` Such an approach seems like it's consistently the fastest: ```python da = xr.DataArray(np.random.randint(0, 100, 100_000)) to_replace = np.random.choice(np.arange(100), 10, replace=False) value = to_replace * 200 test1 = custom_replace(da, to_replace, value) test2 = pandas_replace(da, to_replace, value) test3 = dict_replace(da, to_replace, value) assert test1.equals(test2) assert test1.equals(test3) 6.93 ms ± 295 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit custom_replace(da, to_replace, value) 9.37 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit pandas_replace(da, to_replace, value) 26.8 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit dict_replace(da, to_replace, value) ``` With the advantage growing the number of values involved: ```python da = xr.DataArray(np.random.randint(0, 10_000, 100_000)) to_replace = np.random.choice(np.arange(10_000), 10_000, replace=False) value = to_replace * 200 test1 = custom_replace(da, to_replace, value) test2 = pandas_replace(da, to_replace, value) test3 = dict_replace(da, to_replace, value) assert test1.equals(test2) assert test1.equals(test3) 21.6 ms ± 990 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit custom_replace(da, to_replace, value) 3.12 s ± 574 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit pandas_replace(da, to_replace, value) 42.7 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit dict_replace(da, to_replace, value) ``` In my real-life example, with a DataArray of approx 110 000 elements, with 60 000 values to replace, the custom one takes 33 ms, the dict one takes 135 ms, while pandas takes 26 s (!). Additional context In all cases, we need dealing with NaNs, checking the input, etc.: ```python def replace(da: xr.DataArray, to_replace: Any, value: Any): from xarray.core.utils import is_scalar if is_scalar(to_replace): if not is_scalar(value): raise TypeError("if to_replace is scalar, then value must be a scalar") if np.isnan(to_replace): return da.fillna(value) else: return da.where(da != to_replace, other=value) else: to_replace = np.asarray(to_replace) if to_replace.ndim != 1: raise ValueError("to_replace must be 1D or scalar") if is_scalar(value): value = np.full_like(to_replace, value) else: value = np.asarray(value) if to_replace.shape != value.shape: raise ValueError( f"Replacement arrays must match in shape. " f"Expecting {to_replace.shape} got {value.shape} " ) _, counts = np.unique(to_replace, return_counts=True) if (counts > 1).any(): raise ValueError("to_replace contains duplicates") # Replace NaN values separately, as they will show up as separate values # from numpy.unique. isnan = np.isnan(to_replace) if isnan.any(): i = np.nonzero(isnan)[0] da = da.fillna(value[i]) # Use np.unique to create an inverse index flat = da.values.ravel() uniques, index = np.unique(flat, return_inverse=True) replaceable = np.isin(flat, to_replace) # Create a replacement array in which there is a 1:1 relation between # uniques and the replacement values, so that we can use the inverse index # to select replacement values. valid = np.isin(to_replace, uniques, assume_unique=True) # Remove to_replace values that are not present in da. If no overlap # exists between to_replace and the values in da, just return a copy. if not valid.any(): return da.copy() to_replace = to_replace[valid] value = value[valid] replacement = np.zeros_like(uniques) replacement[np.searchsorted(uniques, to_replace)] = value out = flat.copy() out[replaceable] = replacement[index[replaceable]] return da.copy(data=out.reshape(da.shape)) ``` It think it should be easy to use e.g. let it operate on the numpy arrays so e.g. apply_ufunc will work. The primary issue is whether values can be sorted; in such a case the dict lookup might be an okay fallback? I've had a peek at the pandas implementation, but didn't become much wiser. Anyway, for your consideration! I'd be happy to submit a PR.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6377/reactions", "total_count": 9, "+1": 9, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			13221727	issue

Links from other tables

1 row from issues_id in issues_labels
7 rows from issue in issue_comments