id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 2243685081,I_kwDOAMm_X86Fu-rZ,8945,netCDF4 indexing: `reindex_like` is very slow if dataset not loaded into memory,11130776,closed,0,,,4,2024-04-15T13:26:08Z,2024-04-23T21:49:28Z,2024-04-23T15:33:36Z,NONE,,,,"### What is your issue? Reindexing a dataset without loading it into memory seems to be very slow (about 1000x slower than reindexing after loading into memory). Here is a minimum working example: ``` times = 100 nlat = 200 nlon = 300 fp = xr.Dataset({""fp"": ([""time"", ""lat"", ""lon""], np.arange(times * nlat * nlon).reshape(times, nlat, nlon))}, coords={""time"": pd.date_range(start=""2019-01-01T02:00:00"", periods=times, freq=""1H""), ""lat"": np.arange(nlat), ""lon"": np.arange(nlon)}) flux = xr.Dataset({""flux"": ([""time"", ""lat"", ""lon""], np.arange(nlat * nlon).reshape(1, nlat, nlon))}, coords={""time"": [pd.to_datetime(""2019-01-01"")], ""lat"": np.arange(nlat) + np.random.normal(0.0, 0.01, nlat), ""lon"": np.arange(nlon) + np.random.normal(0.0, 0.01, nlon)}) fp.to_netcdf(""combine_datasets_tests/fp.nc"") flux.to_netcdf(""combine_datasets_tests/flux.nc"") fp1 = xr.open_dataset(""combine_datasets_tests/fp.nc"") flux1 = xr.open_dataset(""combine_datasets_tests/flux.nc"") ``` Then ``` flux1 = flux1.reindex_like(fp1, method=""ffill"", tolerance=None) ``` takes over a minute, while ``` flux1 = flux1.load().reindex_like(fp1, method=""ffill"", tolerance=None) ``` is almost instantaneous (timeit says 91ms, including opening the dataset... I'm not sure if caching is influencing this). Profiling the ""reindex without load"" cell: ``` 804936 function calls (804622 primitive calls) in 93.285 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 92.211 92.211 93.191 93.191 {built-in method _operator.getitem} 1 0.289 0.289 0.980 0.980 utils.py:81(_StartCountStride) 6 0.239 0.040 0.613 0.102 shape_base.py:267(apply_along_axis) 72656 0.109 0.000 0.109 0.000 utils.py:429() 72656 0.085 0.000 0.136 0.000 utils.py:430() 72661 0.051 0.000 0.051 0.000 {built-in method numpy.arange} 145318 0.048 0.000 0.115 0.000 shape_base.py:370() 2 0.045 0.023 0.046 0.023 indexing.py:1334(__getitem__) 6 0.044 0.007 0.044 0.007 numeric.py:136(ones) 145318 0.044 0.000 0.067 0.000 index_tricks.py:690(__next__) 14 0.033 0.002 0.033 0.002 {built-in method numpy.empty} 145333/145325 0.023 0.000 0.023 0.000 {built-in method builtins.next} 1 0.020 0.020 93.275 93.275 duck_array_ops.py:317(where) 21 0.018 0.001 0.018 0.001 {method 'astype' of 'numpy.ndarray' objects} 145330 0.013 0.000 0.013 0.000 {built-in method numpy.asanyarray} 1 0.002 0.002 0.002 0.002 {built-in method _functools.reduce} 1 0.002 0.002 93.279 93.279 variable.py:821(_getitem_with_mask) 18 0.001 0.000 0.001 0.000 {built-in method numpy.zeros} 1 0.000 0.000 0.000 0.000 file_manager.py:226(close) ``` The `getitem` call at the top is from `xarray.backends.netCDF4_.py`, line 114. Because of the jittered coordinates in `flux`, I'm assuming that the index passed to netCDF4 is not consecutive/strictly monotonic integers (0, 1, 2, 3, ...). In the past, this has caused issues: https://github.com/Unidata/netcdf4-python/issues/680. In my venv, netCDF4 was installed from a wheel with the following versions: ``` netcdf4-python version: 1.6.5 HDF5 lib version: 1.12.2 netcdf lib version: 4.9.3-development ``` This is with xarray version 2023.12.0, numpy 1.26, and pandas 1.5.3. I will try to investigate more and hopefully simplify the example. (Can't quite justify spending more time on it at work because this is just to tag a version that was used in some experiments before we switch to zarr as a backend, so hopefully it won't be relevant at that point.)","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8945/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 2095994466,I_kwDOAMm_X8587lZi,8646,`rename_vars` followed by `swap_dims` and `merge` causes swapped dim to reappear,11130776,open,0,,,16,2024-01-23T12:31:54Z,2024-04-05T09:10:15Z,,NONE,,,,"### What happened? I wanted to rename a dimension coordinate for two datasets before merging: `ds = ds.rename_vars(y=""z"").swap_dims(y=""z"")`, and the same for the second data set. After merging the datasets, the merged result has the dimension ""y"" in addition to ""z"". Swapping the order of `rename_vars` and `swap_dims` before merging works in that ""y"" does not reappear, but then ""z"" is listed as a non-dimension coordinate. Doing `rename_vars` followed by `swap_dims` /after/ merging gives the result I wanted, but if I merge again, the same issue occurs. My current solution is to only rename dimension coordinates before saving to netCDF. ### What did you expect to happen? Merging two datasets with the same coordinates and dimensions (but different data variables) should result in a single dataset with all of the data variables from the two datasets and exactly the same coordinates and dimensions. ### Minimal Complete Verifiable Example ```Python import numpy as np import xarray as xr from xarray.core.utils import Frozen A = np.arange(4).reshape((2, 2)) B = np.arange(4).reshape((2, 2)) + 4 ds1 = xr.Dataset({""A"": ([""x"", ""y""], A), ""B"": ([""x"", ""y""], B)}, coords={""x"": (""x"", [1, 2]), ""y"": (""y"", [1, 2])}) ds2 = xr.Dataset({""C"": ([""x"", ""y""], A), ""D"": ([""x"", ""y""], B)}, coords={""x"": (""x"", [1, 2]), ""y"": (""y"", [1, 2])}) assert ds1.dims == Frozen({""x"": 2, ""y"": 2}) assert ds2.dims == Frozen({""x"": 2, ""y"": 2}) ds1_swap = ds1.rename_vars(y=""z"").swap_dims(y=""z"") ds2_swap = ds2.rename_vars(y=""z"").swap_dims(y=""z"") assert ds1_swap.dims == Frozen({""x"": 2, ""z"": 2}) assert ds2_swap.dims == Frozen({""x"": 2, ""z"": 2}) # merging makes the dimension ""y"" reappear (I would expect this assertion to fail): assert xr.merge([ds1_swap, ds2_swap]).dims == Frozen({""x"": 2, ""z"": 2, ""y"": 2}) # renaming and swapping after the merge causes issues later: ds12 = xr.merge([ds1, ds2]).rename_vars(y=""z"").swap_dims(y=""z"") ds3 = xr.Dataset({""E"": ([""x"", ""z""], A), ""F"": ([""x"", ""z""], B)}, coords={""x"": (""x"", [1, 2]), ""z"": (""z"", [1, 2])}) # ds12 and ds3 have the same dimensions: assert ds12.dims == Frozen({""x"": 2, ""z"": 2}) assert ds3.dims == Frozen({""x"": 2, ""z"": 2}) # but merging brings back ""y"" ds123 = xr.merge([ds12, ds3]) assert ds123.dims == Frozen({""x"": 2, ""z"": 2, ""y"": 2}) # as do other operations: ds12_as = ds12.assign_coords(x=(ds12.x + 1)) assert ds12_as.sizes == Frozen({""x"": 2, ""z"": 2, ""y"": 2}) ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. - [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies. ### Relevant log output _No response_ ### Anything else we need to know? _No response_ ### Environment The MVCE works in all venvs I've tried including:
INSTALLED VERSIONS ------------------ commit: None python: 3.10.13 (main, Nov 10 2023, 15:02:19) [GCC 11.4.0] python-bits: 64 OS: Linux OS-release: 6.5.0-14-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development xarray: 2023.11.0 pandas: 1.5.3 numpy: 1.26.2 scipy: 1.11.4 netCDF4: 1.6.5 pydap: None h5netcdf: 1.3.0 h5py: 3.10.0 Nio: None zarr: None cftime: 1.6.3 nc_time_axis: 1.4.1 iris: None bottleneck: None dask: 2023.12.0 distributed: None matplotlib: 3.8.2 cartopy: 0.22.0 seaborn: 0.13.0 numbagg: None fsspec: 2023.12.1 cupy: None pint: None sparse: 0.15.1 flox: None numpy_groupies: None setuptools: 69.0.2 pip: 23.3.1 conda: None pytest: 7.4.3 mypy: None IPython: 8.18.1 sphinx: None /home/brendan/Documents/inversions/.pymc_venv/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn(""Setuptools is replacing distutils."")
INSTALLED VERSIONS ------------------ commit: None python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.81.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2024.1.0 pandas: 2.2.0 numpy: 1.26.3 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 69.0.3 pip: 23.3.2 conda: None pytest: None mypy: None IPython: 8.18.1 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8646/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue