issues: 995207525

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
995207525	MDU6SXNzdWU5OTUyMDc1MjU=	5790	combining 2 arrays with xr.merge() causes temporary spike in memory usage ~3x the combined size of the arrays	23262800	closed	0			6	2021-09-13T18:42:03Z	2022-04-09T15:25:28Z	2022-04-09T15:25:28Z	NONE				What happened: When attempting to combine two arrays of sizes `b1` and `b2` bytes: `xr.merge([da1, da2])`, I observe that memory usage temporarily increases by about ~`3(b1+b2)` bytes. Once the operation finishes, the memory usage has a net increase of `(b1+b2)` bytes, which is what I would expect, since that's the size of the merged array I just created. What I did not expect was the temporary increase of ~`3(b1+b2)` bytes. For small arrays this temporary spike in memory is fine, but for larger arrays this means we are essentially limited to combining arrays of total size below 1/3rd of an instance's memory limit. Anything above that and the temporary spike causes the instance to crash. What you expected to happen: I expected there to be only a memory increase of `b1+b2` bytes, the amount needed to store the merged array. I did not expect memory increase to go higher than that during the merge operation. Minimal Complete Verifiable Example: ```python Put your MCVE code here import numpy as np import xarray as xr import tracemalloc tracemalloc.start() print("(current, peak) memory at start:") print(tracemalloc.get_traced_memory()) create the test data (each is 100 by 100 by 10 array of random floats) Their A and B coordinates are completely matching. Their C coordinates are completely disjoint. data1 = np.random.rand(100, 100, 10) da1 = xr.DataArray( data1, dims=("A", "B", "C"), coords={ "A": [f"A{i}" for i in range(100)], "B": [f"B{i}" for i in range(100)], "C": [f"C{i}" for i in range(10)]}, ) da1.name = "da" data2 = np.random.rand(100, 100, 10) da2 = xr.DataArray( data2, dims=("A", "B", "C"), coords={ "A": [f"A{i}" for i in range(100)], "B": [f"B{i}" for i in range(100)], "C": [f"C{i+10}" for i in range(10)]}, ) da2.name = "da" print("(current, peak) memory after creation of arrays to be combined:") print(tracemalloc.get_traced_memory()) print(f"da1.nbytes = {da1.nbytes}") print(f"da2.nbytes = {da2.nbytes}") da_combined = xr.merge([da1, da2]).to_array() print("(current, peak) memory after merging. You should observe that the peak memory usage is now much higher.") print(tracemalloc.get_traced_memory()) print(f"da_combined.nbytes = {da_combined.nbytes}") print(da_combined) ``` Anything else we need to know?: Interestingly, when I try merging 3 arrays at once, (sizes `b1`, `b2`, `b3`) I observe temporary memory usage increase of about `5(b1+b2+b3)`. I have a hunch that all arrays get aligned to the final merged coordinate space (which is much bigger), before* they are combined, which means at some point in the middle of the process we have a bunch of arrays in memory that have been inflated to the size of the final output array. If that's the case, it seems like it should be possible to make this operation more efficient by creating just one inflated array and adding the data from the input arrays to it in-place? Or is this an expected and unavoidable behavior with merging? (fwiw this also affects several other combination methods, presumably because they use `merge()` under the hood?) Environment: Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.9.6 \| packaged by conda-forge \| (default, Jul 11 2021, 03:39:48) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 4.19.121-linuxkit machine: x86_64 processor: byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.8.0 xarray: 0.17.0 pandas: 1.2.3 numpy: 1.19.5 scipy: 1.6.0 netCDF4: 1.5.6 pydap: None h5netcdf: 0.11.0 h5py: 3.3.0 Nio: None zarr: None cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.4.2 cartopy: None seaborn: None numbagg: None pint: 0.16.1 setuptools: 57.4.0 pip: 21.2.4 conda: None pytest: 6.2.2 IPython: 7.23.1 sphinx: 3.5.2	{ "url": "https://api.github.com/repos/pydata/xarray/issues/5790/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

1 row from issues_id in issues_labels
6 rows from issue in issue_comments