home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1020282789

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1020282789 I_kwDOAMm_X8480Eel 5843 Why are `da.chunks` and `ds.chunks` properties inconsistent? 35968931 closed 0     6 2021-10-07T17:21:01Z 2021-10-29T18:12:22Z 2021-10-29T18:12:22Z MEMBER      

Basically the title, but what I'm referring to is this:

```python In [2]: da = xr.DataArray([[0, 1], [2, 3]], name='foo').chunk(1)

In [3]: ds = da.to_dataset()

In [4]: da.chunks Out[4]: ((1, 1), (1, 1))

In [5]: ds.chunks Out[5]: Frozen({'dim_0': (1, 1), 'dim_1': (1, 1)}) ```

Why does DataArray.chunks return a tuple and Dataset.chunks return a frozen dictionary?

This seems a bit silly, for a few reasons:

1) it means that some perfectly reasonable code might fail unnecessarily if passed a DataArray instead of a Dataset or vice versa, such as

```python
def is_core_dim_chunked(obj, core_dim):
    return len(obj.chunks[core_dim]) > 1
```
which will work as intended for a dataset but raises a `TypeError` for a dataarray.

2) it breaks the pattern we use for .sizes, where

```python
In [14]: da.sizes
Out[14]: Frozen({'dim_0': 2, 'dim_1': 2})

In [15]: ds.sizes
Out[15]: Frozen({'dim_0': 2, 'dim_1': 2})
```

3) if you want the chunks as a tuple they are always accessible via da.data.chunks, which is a more sensible place to look to find the chunks without dimension names.

4) It's an undocumented difference, as the docstrings for ds.chunks and da.chunks both only say

`"""Block dimensions for this dataset’s data or None if it’s not a dask array."""`

which doesn't tell me anything about the return type, or warn me that the return types are different.

EDIT: In fact `DataArray.chunk` doesn't even appear to be listed on the API docs page at all.

In our codebase this difference is mostly washed out by us using ._to_temp_dataset() all the time, and also by the way that the .chunk() method accepts both the tuple and dict form, so both of these invariants hold (but in different ways):

ds == ds.chunk(ds.chunks) da == da.chunk(da.chunks)

I'm not sure whether making this consistent is worth the effort of a significant breaking change though :confused:

(Sort of related to https://github.com/pydata/xarray/issues/2103)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5843/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 6 rows from issue in issue_comments
Powered by Datasette · Queries took 0.602ms · About: xarray-datasette