issue_comments: 822566735

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/4554#issuecomment-822566735	https://api.github.com/repos/pydata/xarray/issues/4554	822566735	MDEyOklzc3VlQ29tbWVudDgyMjU2NjczNQ==	20629530	2021-04-19T15:37:30Z	2021-04-19T15:37:30Z	CONTRIBUTOR	Took a look and it seems to originate from the stacking part and someting in `dask`. In `polyfit`, we rearrange the DataArrays to 2D arrays, so we can run the least squares with `np/dsa.apply_along_axis`. But I checked and the chunking problem seems to appear before any call of the sort. MWE: ```python3 import xarray as xr import dask.array as dsa nz, ny, nx = (10, 20, 30) data = dsa.ones((nz, ny, nx), chunks=(1, 5, nx)) da = xr.DataArray(data, dims=['z', 'y', 'x']) da.chunks ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (5, 5, 5, 5), (30,)) stk = da.stack(zy=['z', 'y']) print(stk.dims, stk.chunks) ('x', 'zy') ((30,), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20)) Merged chunks! ``` And then I went down the rabbit hole (ok it's not that deep) and is all goes down here: https://github.com/pydata/xarray/blob/e0358e586079c12525ce60c4a51b591dc280713b/xarray/core/variable.py#L1507 In `Variable._stack_one` the stacking is performed and `Variable.data.reshape` is called. Dask itself is rechunking the output, merging the chunks. There is a `merge_chunks` kwarg for `reshape`, but I think it has a bug: ```python Let's stack as xarray does: x, z, y -> x, zy data_t = data.transpose(2, 0, 1) # Dask array with shape (30, 10, 20), the same as `reordered` in `Variable._stack_once`. new_data = data_t.reshape((30, -1), merge_chunks=True) # True is the default, this is the same call as in xarray new_data.chunks ((30,), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20)) new_data = data_t.reshape((30, -1), merge_chunks=False) new_data.shape # I'm printing shape because chunks is too large, but see the bug: (30, 6000) # instead of (30, 200)!!! Doesn't happen when we do not transpose. So let's reshape data as z, y, x -> zy, x new_data = data.reshape((-1, 30), merge_chunks=True) new_data.chunks ((5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), (30,)) Chunks were not merged? But this is the output expected by paigem. new_data = data.reshape((-1, 30), merge_chunks=False) new_data.chunks ((5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), (30,)) That's what I expected with merge_chunks=False. ``` For `polyfit` itself, the `apply_along_axis` call could be changed to a `apply_ufunc` with `vectorize=True`, I think this would avoid the problem and behave the same on the user's side. Would need some refactoring.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		732910109