issues: 627600168

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
627600168	MDU6SXNzdWU2Mjc2MDAxNjg=	4112	Unexpected chunking behavior when using `xr.align` with `join='outer'`	14314623	open	0			6	2020-05-29T23:46:31Z	2020-10-06T20:20:34Z		CONTRIBUTOR				I just came across some unexpected behavior, when using `xr.align` with the option `join='outer'` on two Dataarrays which contain dask.arrays and have different dimension lengths. MCVE Code Sample ```python import numpy as np import xarray as xr short_time = xr.cftime_range('2000', periods=12) long_time = xr.cftime_range('2000', periods=120) data_short = np.random.rand(len(short_time)) data_long = np.random.rand(len(long_time)) a = xr.DataArray(data_short, dims=['time'], coords={'time':short_time}).chunk({'time':3}) b = xr.DataArray(data_long, dims=['time'], coords={'time':long_time}).chunk({'time':3}) a,b = xr.align(a,b, join = 'outer') ``` Expected Output As expected `a` is filled with missing values: `python a.plot() b.plot()` But the filled values do not replicate the chunking along the time dimension in `b`. Instead the padded values are in one single chunk, which can be substantially larger than the others. `python a.data` `python b.data` (Quick shoutout for the amazing html representation. This made diagnosing this problem super easy! 🥳 ) Problem Description I think for many problems it would be more appropriate if the padded portion of the array would have a chunking scheme like the longer array. A practical example (which brought me to this issue) is given in the CMIP6 data archive, where some models give output for several members, with some of them running longer than others, leading to problems when these are combined (see intake-esm/#225). Basically for that particular model, there are 5 members with a runtime of 100 years and one member with a runtime of 300 years. I think using `xr.align` leads immediately to a chunk that is 200 years long and blows up the memory on all systems I have tried this on. Is there a way to work around this, or is this behavior intended and I am missing something? cc'ing @dcherian @andersy005 Versions Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.2 \| packaged by conda-forge \| (default, Apr 24 2020, 08:20:52) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1127.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.4 xarray: 0.15.1 pandas: 1.0.3 numpy: 1.18.4 scipy: 1.4.1 netCDF4: 1.5.3 pydap: None h5netcdf: 0.8.0 h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.1.2 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: None bottleneck: None dask: 2.15.0 distributed: 2.15.2 matplotlib: 3.2.1 cartopy: 0.18.0 seaborn: None numbagg: None setuptools: 46.1.3.post20200325 pip: 20.1 conda: None pytest: 5.4.2 IPython: 7.14.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/4112/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			13221727	issue

Links from other tables

1 row from issues_id in issues_labels
6 rows from issue in issue_comments