home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 1208723159

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4285#issuecomment-1208723159 https://api.github.com/repos/pydata/xarray/issues/4285 1208723159 IC_kwDOAMm_X85IC6bX 1852447 2022-08-08T23:30:12Z 2022-08-10T00:02:44Z NONE

Given that you have an array of only list-type, regular-type, and numpy-type (which the prepare function, above, guarantees), here's a one-pass function to get the dtype and shape:

```python def shape_dtype(layout, lateral_context, **kwargs): if layout.is_RegularType: lateral_context["shape"].append(layout.size) elif layout.is_ListType: max_size = ak.max(ak.num(layout)) lateral_context["shape"].append(max_size) elif layout.is_NumpyType: lateral_context["dtype"] = layout.dtype else: raise AssertionError(f"what? {layout.form.type}")

context = {"shape": [len(array)]} array.layout.recursively_apply( shape_dtype, lateral_context=context, return_array=False )

check context for "shape" and "dtype"

```

Here's the application on an array of mixed regular and irregular lists:

```python

array = ak.to_regular(ak.Array([[[[1, 2, 3], []]], [[[4], [5]], [[6, 7], [8]]]]), axis=2) print(array.type) 2 * var * 2 * var * int64

context = {"shape": [len(array)]} array.layout.recursively_apply( ... shape_dtype, lateral_context=context, return_array=False ... ) context {'shape': [2, 2, 2, 3], 'dtype': dtype('int64')} ```

(This recursively_apply is a Swiss Army knife for restructuring or getting data out of layouts that we use internally all over the codebase, and intend to make public in v2: https://github.com/scikit-hep/awkward/issues/516.)

To answer your question about monkey-patching, I think it would be best to make a wrapper. You don't want to give all ak.Array instances properties named shape and dtype, since those properties won't make sense for general types. This is exactly the reason we had to back off on making ak.Array inherit from pandas.api.extensions.ExtensionArray: Pandas wanted it to have methods with names and behaviors that would have been misleading for Awkward Arrays. We think we'll be able to reintroduce Awkward Arrays as Pandas columns by wrapping them—that's what we're doing differently this time.

Here's a start of a wrapper:

```python class RaggedArray: def init(self, array_like): layout = ak.to_layout(array_like, allow_record=False, allow_other=False) behavior = None if isinstance(array_like, ak.Array): behavior = array_like.behavior self._array = ak.Array(layout.recursively_apply(prepare), behavior=behavior)

    context = {"shape": [len(self._array)]}
    self._array.layout.recursively_apply(
        shape_dtype, lateral_context=context, return_array=False
    )
    self._shape = context["shape"]
    self._dtype = context["dtype"]

def __repr__(self):
    # this is pretty cheesy
    return "<Ragged" + repr(self._array)[1:]

@property
def dtype(self):
    return self._dtype

@property
def shape(self):
    return self._shape

def __getitem__(self, where):
    if isinstance(where, RaggedArray):
        where = where._array
    if isinstance(where, tuple):
        where = tuple(x._array if isinstance(x, RaggedArray) else x for x in where)

    out = self._array[where]
    if isinstance(out, ak.Array):
        return RaggedArray(out)
    else:
        return out

def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
    inputs = [x._array if isinstance(x, RaggedArray) else x for x in inputs]
    out = self._array.__array_ufunc__(ufunc, method, *inputs, **kwargs)
    return RaggedArray(out)

def sum(self, axis=None, keepdims=False, mask_identity=False):
    out = ak.sum(self._array, axis=axis, keepdims=keepdims, mask_identity=mask_identity)
    if isinstance(out, ak.Array):
        return RaggedArray(out)
    else:
        return out

```

It keeps an _array (ak.Array), performs all internal operations on the ak.Array level (unwrapping RaggedArrays if necessary), but returns RaggedArrays (if the output is not scalar). It handles only the operations you want it to: this one handles all the complex slicing, NumPy ufuncs, and one reducer, sum.

Thus, it can act as a gatekeeper of what kinds of operations are allowed: ak.* won't recognize RaggedArray, which is good because some ak.* functions would take you out of this "ragged array" subset of types. You can add some non-ufunc NumPy functions with __array_function__, but only the ones that make sense for this subset of types.


I meant to say something earlier about why we go for full generality in types: it's because some of the things we want to do, such as ak.cartesian, require more complex types, and as soon as one function needs it, the whole space needs to be enlarged. For the first year of Awkward Array use, most users wanted it for plain ragged arrays (based on their bug-reports and questions), but after about a year, they were asking about missing values and records, too, because you eventually need them unless you intend to work within a narrow set of functions.

Union arrays are still not widely used, but they can come from some file formats. Some GeoJSON files that I looked at had longitude, latitude points in different list depths because some were points and some were polygons, disambiguated by a string label. That's not good to work with (we can't handle that in Numba, for instance), but if you select all points with some slice, put them in one array, and select all polygons with another slice, putting them in their own array, these each become trivial unions, and that's why I added the squashing of trivial unions to the prepare function example above.

{
    "total_count": 3,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 3,
    "rocket": 0,
    "eyes": 0
}
  667864088
Powered by Datasette · Queries took 1.841ms · About: xarray-datasette