home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 995207525

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
995207525 MDU6SXNzdWU5OTUyMDc1MjU= 5790 combining 2 arrays with xr.merge() causes temporary spike in memory usage ~3x the combined size of the arrays 23262800 closed 0     6 2021-09-13T18:42:03Z 2022-04-09T15:25:28Z 2022-04-09T15:25:28Z NONE      

What happened: When attempting to combine two arrays of sizes b1 and b2 bytes: xr.merge([da1, da2]), I observe that memory usage temporarily increases by about ~3*(b1+b2) bytes. Once the operation finishes, the memory usage has a net increase of (b1+b2) bytes, which is what I would expect, since that's the size of the merged array I just created. What I did not expect was the temporary increase of ~3*(b1+b2) bytes.

For small arrays this temporary spike in memory is fine, but for larger arrays this means we are essentially limited to combining arrays of total size below 1/3rd of an instance's memory limit. Anything above that and the temporary spike causes the instance to crash.

What you expected to happen: I expected there to be only a memory increase of b1+b2 bytes, the amount needed to store the merged array. I did not expect memory increase to go higher than that during the merge operation.

Minimal Complete Verifiable Example:

```python

Put your MCVE code here

import numpy as np import xarray as xr import tracemalloc

tracemalloc.start()

print("(current, peak) memory at start:") print(tracemalloc.get_traced_memory())

create the test data (each is 100 by 100 by 10 array of random floats)

Their A and B coordinates are completely matching. Their C coordinates are completely disjoint.

data1 = np.random.rand(100, 100, 10) da1 = xr.DataArray( data1, dims=("A", "B", "C"), coords={ "A": [f"A{i}" for i in range(100)], "B": [f"B{i}" for i in range(100)], "C": [f"C{i}" for i in range(10)]}, ) da1.name = "da"

data2 = np.random.rand(100, 100, 10) da2 = xr.DataArray( data2, dims=("A", "B", "C"), coords={ "A": [f"A{i}" for i in range(100)], "B": [f"B{i}" for i in range(100)], "C": [f"C{i+10}" for i in range(10)]}, ) da2.name = "da"

print("(current, peak) memory after creation of arrays to be combined:") print(tracemalloc.get_traced_memory()) print(f"da1.nbytes = {da1.nbytes}") print(f"da2.nbytes = {da2.nbytes}")

da_combined = xr.merge([da1, da2]).to_array()

print("(current, peak) memory after merging. You should observe that the peak memory usage is now much higher.") print(tracemalloc.get_traced_memory()) print(f"da_combined.nbytes = {da_combined.nbytes}")

print(da_combined)

```

Anything else we need to know?:

Interestingly, when I try merging 3 arrays at once, (sizes b1, b2, b3) I observe temporary memory usage increase of about 5*(b1+b2+b3). I have a hunch that all arrays get aligned to the final merged coordinate space (which is much bigger), before they are combined, which means at some point in the middle of the process we have a bunch of arrays in memory that have been inflated to the size of the final output array.

If that's the case, it seems like it should be possible to make this operation more efficient by creating just one inflated array and adding the data from the input arrays to it in-place? Or is this an expected and unavoidable behavior with merging? (fwiw this also affects several other combination methods, presumably because they use merge() under the hood?)

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:39:48) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 4.19.121-linuxkit machine: x86_64 processor: byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.8.0 xarray: 0.17.0 pandas: 1.2.3 numpy: 1.19.5 scipy: 1.6.0 netCDF4: 1.5.6 pydap: None h5netcdf: 0.11.0 h5py: 3.3.0 Nio: None zarr: None cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.4.2 cartopy: None seaborn: None numbagg: None pint: 0.16.1 setuptools: 57.4.0 pip: 21.2.4 conda: None pytest: 6.2.2 IPython: 7.23.1 sphinx: 3.5.2
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5790/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 6 rows from issue in issue_comments
Powered by Datasette · Queries took 0.787ms · About: xarray-datasette