issues: 560860376

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
560860376	MDU6SXNzdWU1NjA4NjAzNzY=	3755	Performance problem when doing computation between two arrays with discontinuous indexes	33070178	open	0			0	2020-02-06T08:44:40Z	2020-12-03T18:16:02Z		NONE				MCVE Code Sample ```python import xarray as xr import numpy as np Creating array ds = xr.Dataset() ds["longitude"] = np.arange(0.1,2000.1,1) ds["latitude"] = range(3000) ds["step"] = range(50) ds["field"] = (("step","latitude","longitude"),np.random.randn(50,3000,2000)) ds.to_netcdf("big_array.nc") Create another array ds = xr.Dataset() ds["longitude"] = np.arange(500.1,600.1,1) # Coordinate are a continuous subset of the first array ds["latitude"] = np.arange(510,660) ds["id"] = range(50) ds["field"] = (("longitude","latitude","id"),np.random.randn(100,150,50)) ds.to_netcdf("slicing.nc") Create another array with "discontinuity" in longitude dimension ds = xr.Dataset() ds["longitude"] = list(np.arange(500.1,598.1,1)) +[622.1,640.1] ds["latitude"] = range(510,660) ds["id"] = range(10) ds["mask"] = (("longitude","latitude","id"),np.random.randn(100,150,10)) ds.to_netcdf("no_slicing.nc") Load the Three arrays da = xr.open_dataset("big_array.nc") db = xr.open_dataset("slicing.nc").isel(id=0) dc = xr.open_dataarray("no_slicing.nc").isel(id=0) %timeit dadb 32.3 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit dadc 2.13 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) def slicing_operation(dc,da): """ Slicing when knowing that dc is a subpart of da """ import operator min_lat = np.max([dc.latitude.values.min(),da.latitude.values.min()]) max_lat = np.min([dc.latitude.values.max(),da.latitude.values.max()]) index_lat_field = operator.and_(da.latitude >= min_lat,da.latitude <= max_lat) min_lon = np.max([dc.longitude.values.min(),da.longitude.values.min()]) max_lon = np.min([dc.longitude.values.max(),da.longitude.values.max()]) index_lon_field = operator.and_(da.longitude >= min_lon,da.longitude <= max_lon) `# Extending dc such that it covers all longitude of da # Creating an empty array. dempty = xr.Dataset() dempty["longitude"] = da.longitude.isel(longitude=index_lon_field).values dempty["latitude"] = da.latitude.isel(latitude=index_lat_field).values dc_ext = dc.broadcast_like(dempty) # Performing multiplication and align res,_ = xr.align((dc_ext da),dc) return res` %timeit slicing_operation(dc,da) 43.9 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Checking if we get the same results res_0 = dcda res = slicing_operation(dc,da) print(abs(res_0 - res).max()) ``` Problem Description A performance problem occured when performing operation between da and dc. Computing `da db` takes ~32 ms While db and dc have the same shape, opeartion `dc da` takes ~2.13s (so x 60 in cpu times). This seems linked to the fact tha dc includes discontinuous indexes of da. Indeed, when extending dc (such that its longitude are continuous when compared at da) the computation time take ~44 ms. For me it seems that when doing `da * dc` da dataset is fully loaded in memory first, while only the part involved in computation is loaded when doing `da db`. Indeed when doing : `python %time da.load()` we get the following result: ``` CPU times: user 0 ns, sys: 1.88 s, total: 1.88 s Wall time: 1.89 s `On top of that,`da db`scales with the number of`id `selected while it is not the case for`da* dc ```. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 \| packaged by conda-forge \| (default, Jan 7 2020, 22:33:48) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-957.27.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.14.1 pandas: 0.25.3 numpy: 1.17.4 scipy: 1.4.1 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: None cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.7.6 iris: None bottleneck: 1.3.1 dask: 2.9.1 distributed: 2.9.1 matplotlib: 3.1.2 cartopy: None seaborn: 0.9.0 numbagg: None setuptools: 44.0.0.post20200106 pip: 19.3.1 conda: None pytest: 5.3.2 IPython: 7.11.1 sphinx: 2.3.1	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3755/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			13221727	issue

Links from other tables

1 row from issues_id in issues_labels
0 rows from issue in issue_comments