id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1173497454,I_kwDOAMm_X85F8iZu,6377,[FEATURE]: Add a replace method,13662783,open,0,,,8,2022-03-18T11:46:37Z,2023-06-25T07:52:46Z,,CONTRIBUTOR,,,,"### Is your feature request related to a problem? If I have a DataArray of values: ```python da = xr.DataArray([0, 1, 2, 3, 4, 5]) ``` And I'd like to replace `to_replace=[1, 3, 5]` by `value=[10, 30, 50]`, there's no method `da.replace(to_replace, value)` to do this. There's no easy way like pandas (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) to do this. (Apologies if I've missed related issues, searching for ""replace"" gives many hits as the word is obviously used quite often.) ### Describe the solution you'd like ```python da = xr.DataArray([0, 1, 2, 3, 4, 5]) replaced = da.replace([1, 3, 5], [10, 30, 50]) print(replaced) ``` ``` array([ 0, 10, 2, 30, 4, 50]) Dimensions without coordinates: dim_0 ``` I've had a try at a relatively efficient implementation below. I'm wondering whether it's a worthwhile addition to xarray? ### Describe alternatives you've considered Ignoring issues such as dealing with NaNs, chunks, etc., a simple dict lookup: ```python def dict_replace(da, to_replace, value): d = {k: v for k, v in zip(to_replace, value)} out = np.vectorize(lambda x: d.get(x, x))(da.values) return da.copy(data=out) ``` Alternatively, leveraging pandas: ```python def pandas_replace(da, to_replace, value): df = pd.DataFrame() df[""values""] = da.values.ravel() df[""values""].replace(to_replace, value, inplace=True) return da.copy(data=df[""values""].values.reshape(da.shape)) ``` But I also tried my hand at a custom implementation, letting `np.unique` do the heavy lifting: ```python def custom_replace(da, to_replace, value): # Use np.unique to create an inverse index flat = da.values.ravel() uniques, index = np.unique(flat, return_inverse=True) replaceable = np.isin(flat, to_replace) # Create a replacement array in which there is a 1:1 relation between # uniques and the replacement values, so that we can use the inverse index # to select replacement values. valid = np.isin(to_replace, uniques, assume_unique=True) # Remove to_replace values that are not present in da. If no overlap # exists between to_replace and the values in da, just return a copy. if not valid.any(): return da.copy() to_replace = to_replace[valid] value = value[valid] replacement = np.zeros_like(uniques) replacement[np.searchsorted(uniques, to_replace)] = value out = flat.copy() out[replaceable] = replacement[index[replaceable]] return da.copy(data=out.reshape(da.shape)) ``` Such an approach seems like it's consistently the fastest: ```python da = xr.DataArray(np.random.randint(0, 100, 100_000)) to_replace = np.random.choice(np.arange(100), 10, replace=False) value = to_replace * 200 test1 = custom_replace(da, to_replace, value) test2 = pandas_replace(da, to_replace, value) test3 = dict_replace(da, to_replace, value) assert test1.equals(test2) assert test1.equals(test3) # 6.93 ms ± 295 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit custom_replace(da, to_replace, value) # 9.37 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit pandas_replace(da, to_replace, value) # 26.8 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit dict_replace(da, to_replace, value) ``` With the advantage growing the number of values involved: ```python da = xr.DataArray(np.random.randint(0, 10_000, 100_000)) to_replace = np.random.choice(np.arange(10_000), 10_000, replace=False) value = to_replace * 200 test1 = custom_replace(da, to_replace, value) test2 = pandas_replace(da, to_replace, value) test3 = dict_replace(da, to_replace, value) assert test1.equals(test2) assert test1.equals(test3) # 21.6 ms ± 990 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit custom_replace(da, to_replace, value) # 3.12 s ± 574 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit pandas_replace(da, to_replace, value) # 42.7 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit dict_replace(da, to_replace, value) ``` In my real-life example, with a DataArray of approx 110 000 elements, with 60 000 values to replace, the custom one takes 33 ms, the dict one takes 135 ms, while pandas takes 26 s (!). ### Additional context In all cases, we need dealing with NaNs, checking the input, etc.: ```python def replace(da: xr.DataArray, to_replace: Any, value: Any): from xarray.core.utils import is_scalar if is_scalar(to_replace): if not is_scalar(value): raise TypeError(""if to_replace is scalar, then value must be a scalar"") if np.isnan(to_replace): return da.fillna(value) else: return da.where(da != to_replace, other=value) else: to_replace = np.asarray(to_replace) if to_replace.ndim != 1: raise ValueError(""to_replace must be 1D or scalar"") if is_scalar(value): value = np.full_like(to_replace, value) else: value = np.asarray(value) if to_replace.shape != value.shape: raise ValueError( f""Replacement arrays must match in shape. "" f""Expecting {to_replace.shape} got {value.shape} "" ) _, counts = np.unique(to_replace, return_counts=True) if (counts > 1).any(): raise ValueError(""to_replace contains duplicates"") # Replace NaN values separately, as they will show up as separate values # from numpy.unique. isnan = np.isnan(to_replace) if isnan.any(): i = np.nonzero(isnan)[0] da = da.fillna(value[i]) # Use np.unique to create an inverse index flat = da.values.ravel() uniques, index = np.unique(flat, return_inverse=True) replaceable = np.isin(flat, to_replace) # Create a replacement array in which there is a 1:1 relation between # uniques and the replacement values, so that we can use the inverse index # to select replacement values. valid = np.isin(to_replace, uniques, assume_unique=True) # Remove to_replace values that are not present in da. If no overlap # exists between to_replace and the values in da, just return a copy. if not valid.any(): return da.copy() to_replace = to_replace[valid] value = value[valid] replacement = np.zeros_like(uniques) replacement[np.searchsorted(uniques, to_replace)] = value out = flat.copy() out[replaceable] = replacement[index[replaceable]] return da.copy(data=out.reshape(da.shape)) ``` It think it should be easy to use e.g. let it operate on the numpy arrays so e.g. apply_ufunc will work. The primary issue is whether values can be sorted; in such a case the dict lookup might be an okay fallback? I've had a peek at the pandas implementation, but didn't become much wiser. Anyway, for your consideration! I'd be happy to submit a PR.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6377/reactions"", ""total_count"": 9, ""+1"": 9, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 445745470,MDExOlB1bGxSZXF1ZXN0MjgwMTIwNzIz,2972,ENH: Preserve monotonic descending index order when merging,13662783,open,0,,,4,2019-05-18T19:12:11Z,2022-06-09T14:50:17Z,,CONTRIBUTOR,,0,pydata/xarray/pulls/2972,"* Addresses GH2947 * When indexes were joined in a dataset merge, they would always get sorted in ascending order. This is awkward for geospatial grids, which are nearly always descending in the ""y"" coordinate. * This also caused an inconsistency: when a merge is called on datasets with identical descending indexes, the resulting index is descending. When a merge is called with non-identical descending indexes, the resulting index is ascending. * When indexes are mixed ascending and descending, or non-monotonic, the resulting index is still sorted in ascending order. - [x] Closes #2947 - [x] Tests added - [ ] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API ## Comments I was doing some work and I kept running into the issue described at #2947, so I had a try at a fix. It was somewhat of a hassle to understand the issue because I kept running into seeming inconsistencies. This is caused by the fact that the joiner doesn't sort with a single index: ```python def _get_joiner(join): if join == 'outer': return functools.partial(functools.reduce, operator.or_) ``` That makes sense, since I'm guessing `pandas.Index.union` isn't get called at all. (I still find the workings of `functools` a little hard to infer.) I also noticed that an outer join gets called with e.g. an `.isel` operation, even though there's only one index (so there's not really anything to join). However, skipping the join completely in that case makes several tests fail since dimension labels end up missing (I guess the `joiner` call takes care of it). It's just checking for the specific case now, but it feels like an very specific issue anyway... The merge behavior is slightly different now, which is reflected in the updated test outcomes in `test_dataset.py`. These tests were turning monotonic decreasing indexes into an increasing index; now the decreasing order is maintained.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2972/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 620468256,MDU6SXNzdWU2MjA0NjgyNTY=,4076,Zarr ZipStore versus DirectoryStore: ZipStore requires .close(),13662783,open,0,,,4,2020-05-18T19:58:21Z,2022-04-28T22:37:48Z,,CONTRIBUTOR,,,," I was saving my dataset into a ZipStore -- apparently succesfully -- but then I couldn't reopen them. The issue appears to be that a regular DirectoryStore behaves a little differently: it doesn't need to be closed, while a ZipStore. (I'm not sure how this relates to #2586, the remarks there don't appear to be applicable anymore.) #### MCVE Code Sample This errors: ```python import xarray as xr import zarr # works as expected ds = xr.Dataset({'foo': [2,3,4], 'bar': ('x', [1, 2]), 'baz': 3.14}) ds.to_zarr(zarr.DirectoryStore(""test.zarr"")) print(xr.open_zarr(zarr.DirectoryStore(""test.zarr""))) # error with ValueError ""group not found at path '' ds.to_zarr(zarr.ZipStore(""test.zip"")) print(xr.open_zarr(zarr.ZipStore(""test.zip""))) ``` Calling close, or using `with` does the trick: ```python store = zarr.ZipStore(""test2.zip"") ds.to_zarr(store) store.close() print(xr.open_zarr(zarr.ZipStore(""test2.zip""))) with zarr.ZipStore(""test3.zip"") as store: ds.to_zarr(store) print(xr.open_zarr(zarr.ZipStore(""test3.zip""))) ``` #### Expected Output I think it would be preferable to close the ZipStore in this case. But I might be missing something? #### Problem Description Because `to_zarr` works in this situation with a DirectoryStore, it's easy to assume a ZipStore will work similarly. However, I couldn't get it to read my data back in this case. #### Versions
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 21:48:41) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.15.2.dev41+g8415eefa.d20200419 pandas: 0.25.3 numpy: 1.17.5 scipy: 1.3.1 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.2 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2.14.0+23.gbea4c9a2 distributed: 2.14.0 matplotlib: 3.1.2 cartopy: None seaborn: 0.10.0 numbagg: None pint: None setuptools: 46.1.3.post20200325 pip: 20.0.2 conda: None pytest: 5.3.4 IPython: 7.13.0
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4076/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 386596872,MDU6SXNzdWUzODY1OTY4NzI=,2587,"DataArray constructor still coerces to np.datetime64[ns], not cftime in 0.11.0",13662783,open,0,,,3,2018-12-02T20:34:36Z,2022-04-18T16:06:12Z,,CONTRIBUTOR,,,,"#### Code Sample ```python import xarray as xr import numpy as np from datetime import datetime time = [np.datetime64(datetime.strptime(""10000101"", ""%Y%m%d""))] print(time[0]) print(np.dtype(time[0])) da = xr.DataArray(time, (""time"",), {""time"":time}) print(da) ``` Results in: ``` 1000-01-01T00:00:00.000000 datetime64[us] array(['2169-02-08T23:09:07.419103232'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 2169-02-08T23:09:07.419103232 ``` #### Problem description I was happy to see `cftime` as default in the release notes for 0.11.0: > Xarray will now always use `cftime.datetime` objects, rather than by default trying to coerce them into `np.datetime64[ns]` objects. A `CFTimeIndex` will be used for indexing along time coordinates in these cases. However, it seems that the DataArray constructor does not use `cftime` (yet?), and coerces to `np.datetime64[ns]` here: https://github.com/pydata/xarray/blob/0d6056e8816e3d367a64f36c7f1a5c4e1ce4ed4e/xarray/core/variable.py#L183-L189 #### Expected Output I think I'd expect `cftime.datetime` in this case as well. Some coercion happens anyway as pandas timestamps are turned into `np.datetime64[ns]`. (But perhaps this was already on your radar, and am I just a little too eager!) #### Output of ``xr.show_versions()``
``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None xarray: 0.11.0 pandas: 0.23.3 numpy: 1.15.3 scipy: 1.1.0 netCDF4: 1.3.1 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: None cftime: 1.0.0 PseudonetCDF: None rasterio: 1.0.0 iris: None bottleneck: 1.2.1 cyordereddict: None dask: 0.19.2 distributed: 1.23.2 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: 0.9.0 setuptools: 40.5.0 pip: 18.1 conda: None pytest: 3.6.3 IPython: 6.4.0 sphinx: 1.7.5 ```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2587/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 441341340,MDU6SXNzdWU0NDEzNDEzNDA=,2947,xr.merge always sorts indexes ascending,13662783,open,0,,,2,2019-05-07T17:06:06Z,2019-05-07T21:07:26Z,,CONTRIBUTOR,,,,"#### Code Sample, a copy-pastable example if possible ```python import xarray as xr import numpy as np nrow, ncol = (4, 5) dx, dy = (1.0, -1.0) xmins = (0.0, 3.0, 3.0, 0.0) xmaxs = (5.0, 8.0, 8.0, 5.0) ymins = (0.0, 2.0, 0.0, 2.0) ymaxs = (4.0, 6.0, 4.0, 6.0) data = np.ones((nrow, ncol), dtype=np.float64) das = [] for xmin, xmax, ymin, ymax in zip(xmins, xmaxs, ymins, ymaxs): kwargs = dict( name=""example"", dims=(""y"", ""x""), coords={""y"": np.arange(ymax, ymin, dy), ""x"": np.arange(xmin, xmax, dx)}, ) das.append(xr.DataArray(data, **kwargs)) xr.merge(das) # This won't flip the coordinate: xr.merge([das[0])) ``` #### Problem description Let's say I have a number of geospatial grids that I'd like to merge (for example, loaded with `xr.open_rasterio`). To quote [https://www.perrygeo.com/python-affine-transforms.html](url) > The typical geospatial coordinate reference system is defined on a cartesian plane with the 0,0 origin in the bottom left and X and Y increasing as you go up and to the right. But raster data, coming from its image processing origins, uses a different referencing system to access pixels. We refer to rows and columns with the 0,0 origin in the upper left and rows increase and you move down while the columns increase as you go right. Still a cartesian plane but not the same one. `xr.merge` will alway return the result with ascending coordinates, which creates some issues / confusion later on if you try to write it back to a GDAL format, for example (I've been scratching my head for some time looking at upside-down .tifs). #### Expected Output I think the expected output for these geospatial grids is that; if you provide only DataArrays with positive dx, negative dy; that the merged result comes out with a positive dx and a negative dy as well. When the DataArrays to merge are mixed in coordinate direction (some with ascending, some with descending coordinate values), defaulting to an ascending sort seems sensible. #### A suggestion I saw that the sort is occurring [here, in pandas](https://github.com/pandas-dev/pandas/blob/2bbc0c2c198374546408cb15fff447c1e306f99f/pandas/core/indexes/base.py#L2260-L2265); and that there's a `is_monotonic_decreasing` property in [pandas.core.indexes.base.Index](https://github.com/pandas-dev/pandas/blob/2bbc0c2c198374546408cb15fff447c1e306f99f/pandas/core/indexes/base.py#L1601) I think this could work (it solves my issue at least), in [xarray.core.alignment](https://github.com/pydata/xarray/blob/5aaa6547cd14a713f89dfc7c22643d86fce87916/xarray/core/alignment.py#L125) ```python index = joiner(matching_indexes) if all( (matching_index.is_monotonic_decreasing for matching_index in matching_indexes) ): index = index[::-1] joined_indexes[dim] = index ``` But I lack the knowledge to say whether this plays nice in all cases. And does `index[::-1]` return a view or a copy? (And does it matter?) For reference this is what it looks like now: ```python if (any(not matching_indexes[0].equals(other) for other in matching_indexes[1:]) or dim in unlabeled_dim_sizes): if join == 'exact': raise ValueError( 'indexes along dimension {!r} are not equal' .format(dim)) index = joiner(matching_indexes) joined_indexes[dim] = index else: index = matching_indexes[0] ``` It's also worth highlighting that the `else` branch causes, arguably, some inconsistency. If the indexes are equal, no reversion occurs. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2947/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue