home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 1200110315

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4285#issuecomment-1200110315 https://api.github.com/repos/pydata/xarray/issues/4285 1200110315 IC_kwDOAMm_X85HiDrr 35968931 2022-07-30T07:40:59Z 2022-07-30T07:40:59Z MEMBER

So I actually think we can do this, with some caveats.

I recently found a cool dataset with ragged-like data which has rekindled my interest in this interfacing, and given me a real example to try it out with.

As far as I understand it the main problem is that awkward arrays don't define a shape or dtype attribute. Instead they follow a different model (the "datashape" model). Xarray expects shape and dtype to be defined, and given that those attributes are in the data API standard, this is a pretty reasonable expectation for most cases. (There is a useful discussion here on the data-apis consortium repo about why awkward arrays don't define these attributes in general.)

Conceptually though, it seems to me that shape and dtype do make sense for Awkward arrays, at least for some subset of them, because Awkward's "type" is clearly related to the normal notion of shape and dtype.

Let's take an Awkward array that can be coerced directly to a numpy array:

```python In [27]: rect = ak.Array([[1, 2, 3], [4, 5, 6]]) ...: rect Out[27]: <Array [[1, 2, 3], [4, 5, 6]] type='2 * var * int64'>

In [28]: np.array(rect) Out[28]: array([[1, 2, 3], [4, 5, 6]]) `` Here there is a clear correspondence: the first axis of the awkward array has length 2, and because *in this case* the second axis has a consistent length of 3, we can coerce this to a numpy array withshape=(2,3). The dtype also makes sense, because *in this case* the awkward array only contains data of one type, anint64`.

Now imagine a "ragged" (or "jagged") array, which is like a numpy array except that the lengths along one (or more) of the axes can be variable. Awkward allows this, e.g.

python In [29]: ragged = ak.Array([[1, 2, 3, 100], [4, 5, 6]]) ...: ragged Out[29]: <Array [[1, 2, 3, 100], [4, 5, 6]] type='2 * var * int64'> but a direct coercion to numpy will fail.

However we still conceptually have a "shape". It's either (2, "var"), where "var" means a variable length across the other axes, or alternatively we could say the shape is (2, 4), where 4 is simply the maximum length along the variable-length axis. The latter interpretation is kind of similar to sparse arrays.

In the second case you can still read off the dtype too. However awkward also allows "Union types", which basically means that one array can contain data of multiple numpy dtypes. Unfortunately this seems to completely break the numpy / xarray model, but we can completely ignore this problem if we simply say that xarray should only try to wrap awkward arrays with non-Union types. I think that's okay - a ragged-length array with a fixed dtype would still be extremely useful!


So if we want to wrap an (non-union type) awkward array instance like ragged in xarray we have to do one of two things:

1) Generalise xarray to allow for variable-length dimensions

This seems hard. Xarray's whole model is built assuming that dims has type Mapping[Hashable, int]. It also breaks our normal concept of alignment, which we need to put coordinate variables in DataArrays alongside data variables.

It would also mean a big change to xarray in order to support one unusual type of array, that goes beyond the data API standard. That breaks xarray's general design philosophy of providing a general wrapper and delegating to domain-specific array implementations / backends / etc. for specificity.

2) Expose a version of shape and dtype on Awkward arrays

This doesn't seem as hard, at least for non-union type awkward arrays. In fact this crude monkey-patching seems to mostly work:

```python In [1]: from awkward import Array, num ...: import numpy as np

In [2]: def get_dtype(self) -> np.dtype: ...: if "Union" in str(self.type): ...: raise ValueError("awkward arrays with Union types can't be expressed in terms of a single numpy dtype") ...: ...: datatype = str(self.type).split(" * ")[-1] ...: ...: if datatype == "string": ...: return np.dtype("str") ...: else: ...: return np.dtype(datatype) ...:

In [3]: def get_shape(self): ...: if "Union" in str(self.type): ...: raise ValueError("awkward arrays with Union types can't be expressed in terms of a single numpy dtype") ...: ...: lengths = str(self.type).split(" * ")[:-1] ...: ...: for axis in range(self.ndim): ...: if lengths[axis] == "var": ...: lengths[axis] = np.max(num(self, axis)) ...: else: ...: lengths[axis] = int(lengths[axis]) ...: ...: return tuple(lengths) ...:

In [4]: def get_size(self): ...: return np.prod(get_shape(self)) ...:

In [5]: setattr(Array, 'dtype', property(get_dtype)) ...: setattr(Array, 'shape', property(get_shape)) ...: setattr(Array, 'size', property(get_size)) `` Now if we make the same ragged array but with the monkey-patched class, we have a sensible return value fordtype,shape, andsize`, which means that the xarray constructors will accept our Array now!

```python In [6]: ragged = Array([[1, 2, 3, 100], [4, 5, 6]])

In [7]: import xarray as xr

In [8]: da = xr.DataArray(ragged, dims=['x', 't'])

In [17]: da Out[17]: <xarray.DataArray (x: 2, t: 4)> <Array [[1, 2, 3, 100], [4, 5, 6]] type='2 * var * int64'> Dimensions without coordinates: x, t

In [18]: da.dtype Out[18]: dtype('int64')

In [19]: da.size Out[19]: 8

In [20]: da.shape Out[20]: (2, 4) ``` Promising...

Let's try indexing: ```python In [21]: da.isel(t=2) Out[21]: <xarray.DataArray (x: 2)> <Array [3, 6] type='2 * int64'> Dimensions without coordinates: x

In [22]: da.isel(t=4)

ValueError Traceback (most recent call last) Input In [22], in <cell line: 1>() ----> 1 da.isel(t=4)

...

File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/highlevel.py:991, in Array.getitem(self, where) 579 """ 580 Args: 581 where (many types supported; see below): Index of positions to (...) 988 have the same dimension as the array being indexed. 989 """ 990 if not hasattr(self, "_tracers"): --> 991 tmp = ak._util.wrap(self.layout[where], self._behavior) 992 else: 993 tmp = ak._connect._jax.jax_utils._jaxtracers_getitem(self, where)

ValueError: in ListOffsetArray64 attempting to get 4, index out of range

(https://github.com/scikit-hep/awkward-1.0/blob/1.8.0/src/cpu-kernels/awkward_NumpyArray_getitem_next_at.cpp#L21) ``` That's what should happen - xarray delegates the indexing to the underlying array, which throws an error if there is a problem.

Arithmetic also seems to work python In [23]: da * 2 Out[23]: <xarray.DataArray (x: 2, t: 4)> <Array [[2, 4, 6, 200], [8, 10, 12]] type='2 * var * int64'> Dimensions without coordinates: x, t

But we hit snags with numpy functions ```python In [24]: np.mean(da)


TypeError Traceback (most recent call last) Input In [24], in <cell line: 1>() ----> 1 np.mean(da)

File <array_function internals>:180, in mean(args, *kwargs)

File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3430, in mean(a, axis, dtype, out, keepdims, where) 3428 pass 3429 else: -> 3430 return mean(axis=axis, dtype=dtype, out=out, kwargs) 3432 return _methods._mean(a, axis=axis, dtype=dtype, 3433 out=out, kwargs)

File ~/Documents/Work/Code/xarray/xarray/core/_reductions.py:1478, in DataArrayReductions.mean(self, dim, skipna, keep_attrs, kwargs) 1403 def mean( 1404 self, 1405 dim: None | Hashable | Sequence[Hashable] = None, (...) 1409 kwargs: Any, 1410 ) -> DataArray: 1411 """ 1412 Reduce this DataArray's data by applying mean along some dimension(s). 1413 (...) 1476 array(nan) 1477 """ -> 1478 return self.reduce( 1479 duck_array_ops.mean, 1480 dim=dim, 1481 skipna=skipna, 1482 keep_attrs=keep_attrs, 1483 **kwargs, 1484 )

File ~/Documents/Work/Code/xarray/xarray/core/dataarray.py:2930, in DataArray.reduce(self, func, dim, axis, keep_attrs, keepdims, kwargs) 2887 def reduce( 2888 self: T_DataArray, 2889 func: Callable[..., Any], (...) 2895 kwargs: Any, 2896 ) -> T_DataArray: 2897 """Reduce this array by applying func along some dimension(s). 2898 2899 Parameters (...) 2927 summarized data and the indicated dimension(s) removed. 2928 """ -> 2930 var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs) 2931 return self._replace_maybe_drop_dims(var)

File ~/Documents/Work/Code/xarray/xarray/core/variable.py:1854, in Variable.reduce(self, func, dim, axis, keep_attrs, keepdims, kwargs) 1852 data = func(self.data, axis=axis, kwargs) 1853 else: -> 1854 data = func(self.data, **kwargs) 1856 if getattr(data, "shape", ()) == self.shape: 1857 dims = self.dims

File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:579, in mean(array, axis, skipna, kwargs) 577 return _to_pytimedelta(mean_timedeltas, unit="us") + offset 578 else: --> 579 return _mean(array, axis=axis, skipna=skipna, kwargs)

File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:341, in _create_nan_agg_method.<locals>.f(values, axis, skipna, kwargs) 339 with warnings.catch_warnings(): 340 warnings.filterwarnings("ignore", "All-NaN slice encountered") --> 341 return func(values, axis=axis, kwargs) 342 except AttributeError: 343 if not is_duck_dask_array(values):

File <array_function internals>:180, in mean(args, *kwargs)

File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/highlevel.py:1434, in Array.array_function(self, func, types, args, kwargs) 1417 def array_function(self, func, types, args, kwargs): 1418 """ 1419 Intercepts attempts to pass this Array to those NumPy functions other 1420 than universal functions that have an Awkward equivalent. (...) 1432 See also #array_ufunc. 1433 """ -> 1434 return ak._connect._numpy.array_function(func, types, args, kwargs)

File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/_connect/_numpy.py:43, in array_function(func, types, args, kwargs) 41 return out 42 else: ---> 43 return function(args, *kwargs)

TypeError: mean() got an unexpected keyword argument 'dtype' ``` This seems fixable though.

In fact I think if we changed https://github.com/pydata/xarray/issues/6845 (@dcherian) then this alternative would already work

```python In [25]: import awkward as ak

In [26]: ak.mean(da)

ValueError Traceback (most recent call last) Input In [26], in <cell line: 1>() ----> 1 ak.mean(da)

File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/reducers.py:971, in mean(x, weight, axis, keepdims, mask_identity) 969 with np.errstate(invalid="ignore"): 970 if weight is None: --> 971 sumw = count(x, axis=axis, keepdims=keepdims, mask_identity=mask_identity) 972 sumwx = sum(x, axis=axis, keepdims=keepdims, mask_identity=mask_identity) 973 else:

File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/reducers.py:79, in count(array, axis, keepdims, mask_identity) 10 def count(array, axis=None, keepdims=False, mask_identity=False): 11 """ 12 Args: 13 array: Data in which to count elements. (...) 77 to turn the None values into something that would be counted. 78 """ ---> 79 layout = ak.operations.convert.to_layout( 80 array, allow_record=False, allow_other=False 81 ) 82 if axis is None: 84 def reduce(xs):

File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/convert.py:1917, in to_layout(array, allow_record, allow_other, numpytype) 1914 return from_iter([array], highlevel=False) 1916 elif isinstance(array, Iterable): -> 1917 return from_iter(array, highlevel=False) 1919 elif not allow_other: 1920 raise TypeError( 1921 f"{array} cannot be converted into an Awkward Array" 1922 + ak._util.exception_suffix(file) 1923 )

File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/convert.py:891, in from_iter(iterable, highlevel, behavior, allow_record, initial, resize) 889 out = ak.layout.ArrayBuilder(initial=initial, resize=resize) 890 for x in iterable: --> 891 out.fromiter(x) 892 layout = out.snapshot() 893 return ak._util.maybe_wrap(layout, behavior, highlevel)

ValueError: cannot convert <xarray.DataArray ()> array(1) (type DataArray) to an array element

(https://github.com/scikit-hep/awkward-1.0/blob/1.8.0/src/python/content.cpp#L974) ```


Suggestion: How about awkward offer a specialized array class which uses the same fast code underneath but disallows Union types, and follows the array API standard, implementing shape, dtype etc. as described above. That should then "just work" in xarray, in the same way that sparse arrays already do.

Am I missing anything here? @jpivarski


tl;dr We probably could support awkward arrays, at least instances where all values have the same dtype.

{
    "total_count": 4,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 1,
    "eyes": 2
}
  667864088
Powered by Datasette · Queries took 0.714ms · About: xarray-datasette