home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 822566735

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4554#issuecomment-822566735 https://api.github.com/repos/pydata/xarray/issues/4554 822566735 MDEyOklzc3VlQ29tbWVudDgyMjU2NjczNQ== 20629530 2021-04-19T15:37:30Z 2021-04-19T15:37:30Z CONTRIBUTOR

Took a look and it seems to originate from the stacking part and someting in dask.

In polyfit, we rearrange the DataArrays to 2D arrays, so we can run the least squares with np/dsa.apply_along_axis. But I checked and the chunking problem seems to appear before any call of the sort. MWE:

```python3 import xarray as xr import dask.array as dsa nz, ny, nx = (10, 20, 30) data = dsa.ones((nz, ny, nx), chunks=(1, 5, nx)) da = xr.DataArray(data, dims=['z', 'y', 'x']) da.chunks

((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (5, 5, 5, 5), (30,))

stk = da.stack(zy=['z', 'y']) print(stk.dims, stk.chunks)

('x', 'zy') ((30,), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20))

Merged chunks!

```

And then I went down the rabbit hole (ok it's not that deep) and is all goes down here: https://github.com/pydata/xarray/blob/e0358e586079c12525ce60c4a51b591dc280713b/xarray/core/variable.py#L1507

In Variable._stack_one the stacking is performed and Variable.data.reshape is called. Dask itself is rechunking the output, merging the chunks. There is a merge_chunks kwarg for reshape, but I think it has a bug:

```python

Let's stack as xarray does: x, z, y -> x, zy

data_t = data.transpose(2, 0, 1) # Dask array with shape (30, 10, 20), the same as reordered in Variable._stack_once.

new_data = data_t.reshape((30, -1), merge_chunks=True) # True is the default, this is the same call as in xarray new_data.chunks

((30,), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20))

new_data = data_t.reshape((30, -1), merge_chunks=False) new_data.shape # I'm printing shape because chunks is too large, but see the bug:

(30, 6000) # instead of (30, 200)!!!

Doesn't happen when we do not transpose. So let's reshape data as z, y, x -> zy, x

new_data = data.reshape((-1, 30), merge_chunks=True) new_data.chunks

((5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), (30,))

Chunks were not merged? But this is the output expected by paigem.

new_data = data.reshape((-1, 30), merge_chunks=False) new_data.chunks

((5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), (30,))

That's what I expected with merge_chunks=False.

```

For polyfit itself, the apply_along_axis call could be changed to a apply_ufunc with vectorize=True, I think this would avoid the problem and behave the same on the user's side. Would need some refactoring.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  732910109
Powered by Datasette · Queries took 0.578ms · About: xarray-datasette