issues: 2243685081

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
2243685081	I_kwDOAMm_X86Fu-rZ	8945	netCDF4 indexing: `reindex_like` is very slow if dataset not loaded into memory	11130776	closed	0			4	2024-04-15T13:26:08Z	2024-04-23T21:49:28Z	2024-04-23T15:33:36Z	NONE				What is your issue? Reindexing a dataset without loading it into memory seems to be very slow (about 1000x slower than reindexing after loading into memory). Here is a minimum working example: ``` times = 100 nlat = 200 nlon = 300 fp = xr.Dataset({"fp": (["time", "lat", "lon"], np.arange(times * nlat * nlon).reshape(times, nlat, nlon))}, coords={"time": pd.date_range(start="2019-01-01T02:00:00", periods=times, freq="1H"), "lat": np.arange(nlat), "lon": np.arange(nlon)}) flux = xr.Dataset({"flux": (["time", "lat", "lon"], np.arange(nlat * nlon).reshape(1, nlat, nlon))}, coords={"time": [pd.to_datetime("2019-01-01")], "lat": np.arange(nlat) + np.random.normal(0.0, 0.01, nlat), "lon": np.arange(nlon) + np.random.normal(0.0, 0.01, nlon)}) fp.to_netcdf("combine_datasets_tests/fp.nc") flux.to_netcdf("combine_datasets_tests/flux.nc") fp1 = xr.open_dataset("combine_datasets_tests/fp.nc") flux1 = xr.open_dataset("combine_datasets_tests/flux.nc") ``` Then `flux1 = flux1.reindex_like(fp1, method="ffill", tolerance=None)` takes over a minute, while `flux1 = flux1.load().reindex_like(fp1, method="ffill", tolerance=None)` is almost instantaneous (timeit says 91ms, including opening the dataset... I'm not sure if caching is influencing this). Profiling the "reindex without load" cell: ``` 804936 function calls (804622 primitive calls) in 93.285 seconds Ordered by: internal time ncalls tottime 1 1 6 72656 72656 72661 145318 2 6 145318 14 145333/145325 1 21 145330 1 1 18 1 ``` percall cumtime percall filename:lineno(function) 92.211 92.211 93.191 93.191 {built-in method _operator.getitem} 0.289 0.289 0.980 0.980 utils.py:81(_StartCountStride) 0.239 0.040 0.613 0.102 shape_base.py:267(apply_along_axis) 0.109 0.000 0.109 0.000 utils.py:429(<lambda>) 0.085 0.000 0.136 0.000 utils.py:430(<lambda>) 0.051 0.000 0.051 0.000 {built-in method numpy.arange} 0.048 0.000 0.115 0.000 shape_base.py:370(<genexpr>) 0.045 0.023 0.046 0.023 indexing.py:1334(getitem) 0.044 0.007 0.044 0.007 numeric.py:136(ones) 0.044 0.000 0.067 0.000 index_tricks.py:690(next) 0.033 0.002 0.033 0.002 {built-in method numpy.empty} 0.023 0.000 0.023 0.000 {built-in method builtins.next} 0.020 0.020 93.275 93.275 duck_array_ops.py:317(where) 0.018 0.001 0.018 0.001 {method 'astype' of 'numpy.ndarray' objects} 0.013 0.000 0.013 0.000 {built-in method numpy.asanyarray} 0.002 0.002 0.002 0.002 {built-in method _functools.reduce} 0.002 0.002 93.279 93.279 variable.py:821(_getitem_with_mask) 0.001 0.000 0.001 0.000 {built-in method numpy.zeros} 0.000 0.000 0.000 0.000 file_manager.py:226(close) The `getitem` call at the top is from `xarray.backends.netCDF4_.py`, line 114. Because of the jittered coordinates in `flux`, I'm assuming that the index passed to netCDF4 is not consecutive/strictly monotonic integers (0, 1, 2, 3, ...). In the past, this has caused issues: https://github.com/Unidata/netcdf4-python/issues/680. In my venv, netCDF4 was installed from a wheel with the following versions: `netcdf4-python version: 1.6.5 HDF5 lib version: 1.12.2 netcdf lib version: 4.9.3-development` This is with xarray version 2023.12.0, numpy 1.26, and pandas 1.5.3. I will try to investigate more and hopefully simplify the example. (Can't quite justify spending more time on it at work because this is just to tag a version that was used in some experiments before we switch to zarr as a backend, so hopefully it won't be relevant at that point.)	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8945/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

1 row from issues_id in issues_labels
0 rows from issue in issue_comments