issue_comments: 1200110315
This data as json
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/pydata/xarray/issues/4285#issuecomment-1200110315 | https://api.github.com/repos/pydata/xarray/issues/4285 | 1200110315 | IC_kwDOAMm_X85HiDrr | 35968931 | 2022-07-30T07:40:59Z | 2022-07-30T07:40:59Z | MEMBER | So I actually think we can do this, with some caveats. I recently found a cool dataset with ragged-like data which has rekindled my interest in this interfacing, and given me a real example to try it out with. As far as I understand it the main problem is that awkward arrays don't define a Conceptually though, it seems to me that Let's take an Awkward array that can be coerced directly to a numpy array: ```python In [27]: rect = ak.Array([[1, 2, 3], [4, 5, 6]]) ...: rect Out[27]: <Array [[1, 2, 3], [4, 5, 6]] type='2 * var * int64'> In [28]: np.array(rect)
Out[28]:
array([[1, 2, 3],
[4, 5, 6]])
Now imagine a "ragged" (or "jagged") array, which is like a numpy array except that the lengths along one (or more) of the axes can be variable. Awkward allows this, e.g.
However we still conceptually have a "shape". It's either In the second case you can still read off the dtype too. However awkward also allows "Union types", which basically means that one array can contain data of multiple numpy dtypes. Unfortunately this seems to completely break the numpy / xarray model, but we can completely ignore this problem if we simply say that xarray should only try to wrap awkward arrays with non-Union types. I think that's okay - a ragged-length array with a fixed dtype would still be extremely useful! So if we want to wrap an (non-union type) awkward array instance like 1) Generalise xarray to allow for variable-length dimensions This seems hard. Xarray's whole model is built assuming that It would also mean a big change to xarray in order to support one unusual type of array, that goes beyond the data API standard. That breaks xarray's general design philosophy of providing a general wrapper and delegating to domain-specific array implementations / backends / etc. for specificity. 2) Expose a version of This doesn't seem as hard, at least for non-union type awkward arrays. In fact this crude monkey-patching seems to mostly work: ```python In [1]: from awkward import Array, num ...: import numpy as np In [2]: def get_dtype(self) -> np.dtype: ...: if "Union" in str(self.type): ...: raise ValueError("awkward arrays with Union types can't be expressed in terms of a single numpy dtype") ...: ...: datatype = str(self.type).split(" * ")[-1] ...: ...: if datatype == "string": ...: return np.dtype("str") ...: else: ...: return np.dtype(datatype) ...: In [3]: def get_shape(self): ...: if "Union" in str(self.type): ...: raise ValueError("awkward arrays with Union types can't be expressed in terms of a single numpy dtype") ...: ...: lengths = str(self.type).split(" * ")[:-1] ...: ...: for axis in range(self.ndim): ...: if lengths[axis] == "var": ...: lengths[axis] = np.max(num(self, axis)) ...: else: ...: lengths[axis] = int(lengths[axis]) ...: ...: return tuple(lengths) ...: In [4]: def get_size(self): ...: return np.prod(get_shape(self)) ...: In [5]: setattr(Array, 'dtype', property(get_dtype))
...: setattr(Array, 'shape', property(get_shape))
...: setattr(Array, 'size', property(get_size))
```python In [6]: ragged = Array([[1, 2, 3, 100], [4, 5, 6]]) In [7]: import xarray as xr In [8]: da = xr.DataArray(ragged, dims=['x', 't']) In [17]: da Out[17]: <xarray.DataArray (x: 2, t: 4)> <Array [[1, 2, 3, 100], [4, 5, 6]] type='2 * var * int64'> Dimensions without coordinates: x, t In [18]: da.dtype Out[18]: dtype('int64') In [19]: da.size Out[19]: 8 In [20]: da.shape Out[20]: (2, 4) ``` Promising... Let's try indexing: ```python In [21]: da.isel(t=2) Out[21]: <xarray.DataArray (x: 2)> <Array [3, 6] type='2 * int64'> Dimensions without coordinates: x In [22]: da.isel(t=4)ValueError Traceback (most recent call last) Input In [22], in <cell line: 1>() ----> 1 da.isel(t=4) ... File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/highlevel.py:991, in Array.getitem(self, where) 579 """ 580 Args: 581 where (many types supported; see below): Index of positions to (...) 988 have the same dimension as the array being indexed. 989 """ 990 if not hasattr(self, "_tracers"): --> 991 tmp = ak._util.wrap(self.layout[where], self._behavior) 992 else: 993 tmp = ak._connect._jax.jax_utils._jaxtracers_getitem(self, where) ValueError: in ListOffsetArray64 attempting to get 4, index out of range (https://github.com/scikit-hep/awkward-1.0/blob/1.8.0/src/cpu-kernels/awkward_NumpyArray_getitem_next_at.cpp#L21) ``` That's what should happen - xarray delegates the indexing to the underlying array, which throws an error if there is a problem. Arithmetic also seems to work
But we hit snags with numpy functions ```python In [24]: np.mean(da) TypeError Traceback (most recent call last) Input In [24], in <cell line: 1>() ----> 1 np.mean(da) File <array_function internals>:180, in mean(args, *kwargs) File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3430, in mean(a, axis, dtype, out, keepdims, where) 3428 pass 3429 else: -> 3430 return mean(axis=axis, dtype=dtype, out=out, kwargs) 3432 return _methods._mean(a, axis=axis, dtype=dtype, 3433 out=out, kwargs) File ~/Documents/Work/Code/xarray/xarray/core/_reductions.py:1478, in DataArrayReductions.mean(self, dim, skipna, keep_attrs, kwargs)
1403 def mean(
1404 self,
1405 dim: None | Hashable | Sequence[Hashable] = None,
(...)
1409 kwargs: Any,
1410 ) -> DataArray:
1411 """
1412 Reduce this DataArray's data by applying File ~/Documents/Work/Code/xarray/xarray/core/dataarray.py:2930, in DataArray.reduce(self, func, dim, axis, keep_attrs, keepdims, kwargs)
2887 def reduce(
2888 self: T_DataArray,
2889 func: Callable[..., Any],
(...)
2895 kwargs: Any,
2896 ) -> T_DataArray:
2897 """Reduce this array by applying File ~/Documents/Work/Code/xarray/xarray/core/variable.py:1854, in Variable.reduce(self, func, dim, axis, keep_attrs, keepdims, kwargs) 1852 data = func(self.data, axis=axis, kwargs) 1853 else: -> 1854 data = func(self.data, **kwargs) 1856 if getattr(data, "shape", ()) == self.shape: 1857 dims = self.dims File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:579, in mean(array, axis, skipna, kwargs) 577 return _to_pytimedelta(mean_timedeltas, unit="us") + offset 578 else: --> 579 return _mean(array, axis=axis, skipna=skipna, kwargs) File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:341, in _create_nan_agg_method.<locals>.f(values, axis, skipna, kwargs) 339 with warnings.catch_warnings(): 340 warnings.filterwarnings("ignore", "All-NaN slice encountered") --> 341 return func(values, axis=axis, kwargs) 342 except AttributeError: 343 if not is_duck_dask_array(values): File <array_function internals>:180, in mean(args, *kwargs) File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/highlevel.py:1434, in Array.array_function(self, func, types, args, kwargs) 1417 def array_function(self, func, types, args, kwargs): 1418 """ 1419 Intercepts attempts to pass this Array to those NumPy functions other 1420 than universal functions that have an Awkward equivalent. (...) 1432 See also #array_ufunc. 1433 """ -> 1434 return ak._connect._numpy.array_function(func, types, args, kwargs) File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/_connect/_numpy.py:43, in array_function(func, types, args, kwargs) 41 return out 42 else: ---> 43 return function(args, *kwargs) TypeError: mean() got an unexpected keyword argument 'dtype' ``` This seems fixable though. In fact I think if we changed https://github.com/pydata/xarray/issues/6845 (@dcherian) then this alternative would already work ```python In [25]: import awkward as ak In [26]: ak.mean(da)ValueError Traceback (most recent call last) Input In [26], in <cell line: 1>() ----> 1 ak.mean(da) File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/reducers.py:971, in mean(x, weight, axis, keepdims, mask_identity) 969 with np.errstate(invalid="ignore"): 970 if weight is None: --> 971 sumw = count(x, axis=axis, keepdims=keepdims, mask_identity=mask_identity) 972 sumwx = sum(x, axis=axis, keepdims=keepdims, mask_identity=mask_identity) 973 else: File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/reducers.py:79, in count(array, axis, keepdims, mask_identity) 10 def count(array, axis=None, keepdims=False, mask_identity=False): 11 """ 12 Args: 13 array: Data in which to count elements. (...) 77 to turn the None values into something that would be counted. 78 """ ---> 79 layout = ak.operations.convert.to_layout( 80 array, allow_record=False, allow_other=False 81 ) 82 if axis is None: 84 def reduce(xs): File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/convert.py:1917, in to_layout(array, allow_record, allow_other, numpytype) 1914 return from_iter([array], highlevel=False) 1916 elif isinstance(array, Iterable): -> 1917 return from_iter(array, highlevel=False) 1919 elif not allow_other: 1920 raise TypeError( 1921 f"{array} cannot be converted into an Awkward Array" 1922 + ak._util.exception_suffix(file) 1923 ) File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/convert.py:891, in from_iter(iterable, highlevel, behavior, allow_record, initial, resize) 889 out = ak.layout.ArrayBuilder(initial=initial, resize=resize) 890 for x in iterable: --> 891 out.fromiter(x) 892 layout = out.snapshot() 893 return ak._util.maybe_wrap(layout, behavior, highlevel) ValueError: cannot convert <xarray.DataArray ()> array(1) (type DataArray) to an array element (https://github.com/scikit-hep/awkward-1.0/blob/1.8.0/src/python/content.cpp#L974) ``` Suggestion: How about awkward offer a specialized array class which uses the same fast code underneath but disallows Union types, and follows the array API standard, implementing Am I missing anything here? @jpivarski tl;dr We probably could support awkward arrays, at least instances where all values have the same dtype. |
{ "total_count": 4, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 1, "eyes": 2 } |
667864088 |