html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/4285#issuecomment-1302254100,https://api.github.com/repos/pydata/xarray/issues/4285,1302254100,IC_kwDOAMm_X85NntIU,1852447,2022-11-03T15:07:36Z,2022-11-03T15:07:36Z,NONE,"Send me an email address, and I'll send you the Zoom URL. The email that you have listed here:

http://tom-nicholas.com/contact/

doesn't work (bounced back).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,667864088
https://github.com/pydata/xarray/issues/4285#issuecomment-1297615976,https://api.github.com/repos/pydata/xarray/issues/4285,1297615976,IC_kwDOAMm_X85NWAxo,1852447,2022-10-31T20:04:12Z,2022-10-31T20:04:12Z,NONE,"@milancurcic, @joshmoore, and I are all available on Thursday, November 3 at 11am U.S. Central (12pm U.S. Eastern/Florida, 5pm Central European/Germany: note the unusual U.S.-Europe difference this week, 16:00 UTC). Let's meet then!

I sent a Google calendar invitation to both of you at that time, which contains a Zoom URL. If anyone else is interested, let me know and I'll send you the Zoom URL as well (just not on a public GitHub comment).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,667864088
https://github.com/pydata/xarray/issues/4285#issuecomment-1295475130,https://api.github.com/repos/pydata/xarray/issues/4285,1295475130,IC_kwDOAMm_X85NN2G6,1852447,2022-10-28T21:15:55Z,2022-10-28T21:15:55Z,NONE,"> What do you think should be the next step? Should we plan a video call to explore options?

Everyone who is interested in this, but particularly @milancurcic, please fill out this poll: https://www.when2meet.com/?17481732-uGwNn and we'll meet by Zoom (URL to be distributed later) to talk about RaggedArray. I'll pick a best time from these responses on Monday.

Thanks!","{""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,667864088
https://github.com/pydata/xarray/issues/4285#issuecomment-1283043390,https://api.github.com/repos/pydata/xarray/issues/4285,1283043390,IC_kwDOAMm_X85MebA-,1852447,2022-10-18T21:46:29Z,2022-10-18T21:46:29Z,NONE,"This sounds good to me! `RaggedArray` can be a well-defined subset of types that have clear use-cases (the ones @TomNicholas listed). The thing I was worried about is that most functions in Awkward Array ignore the boundary line between `RaggedArray` and non-`RaggedArray`; defining it as a new type with its own collection of functions (or just methods) in CloudDrift isolates it in a way that you can ensure that your functions stay within that boundary.

To represent a `RaggedArray` without wrapping an Awkward Array, you can store it as a sequence of _n_ offsets arrays for depth-_n_ lists of lists and numerical contents. (Whereas an Awkward Array is a tree of 1D buffers, `RaggedArray` can be a sequence of 1D buffers.) We can then make sure that when you do want to convert it to and from Awkward Arrays, you can do so in a zero-copy way. That way, if you want to define some functions, particularly `__getitem__` with complex slices, by calling the corresponding Awkward function, you can do it by converting and then converting back, knowing that the conversion overhead is all _O(1)_.

(Same for `xarray.Dataset`!)

**I'm in favor of a video call meeting to discuss this. In general, I'm busiest on U.S. mornings, on Wednesday and Thursday, but perhaps you can send a [when2meet](https://www.when2meet.com/) or equivalent poll?**

---------

One thing that could be discussed in writing (maybe more easily) is what data types you would consider in scope for `RaggedArray`. (I've reminded myself of the use-cases above, but it doesn't fully answer this question.)

That is,

  1. You'll want the numerical data, the end of your sequence of 1D buffers, to be arbitrary NumPy types or some reasonable subset. That's a given.
  2. You'll want ragged arrays of those. Ragged arrays can be represented in several ways:
    * `offsets` buffer whose length is 1 more than the length of the array. Every neighboring pair of integers is the starting and stopping index of the content of a nested list. The integers must be non-decreasing since they are the cumulative sum of list lengths. This is pyarrow's `ListArray` and Awkward's `ListOffsetArray`.
    * `starts` and `stops` buffers of the same length as the length of the array. Every `starts[i]` and `stops[i]` is the starting and stopping index of the content of a nested list. These may be equivalent to an `offsets` buffer or they can be in a random order, not cover all of the content, or cover the content multiple times. The value of such a thing is that reordering, filtering, or duplicating the set of lists is not an operation that needs to propagate through every level of the sequence of buffers, so it's good for intermediate calculations. pyarrow has no equivalent, but it's Awkward's `ListArray`.
    * a `parents` buffer with the same length as the _content_; each `parents[j]` indicates which list `j` the `content[j]` belongs to. They may be contiguous or not. This is a Normal Form in database renormalization, and it's useful for some operations that propagate upward, such as reducers (sum, max, etc.). Neither pyarrow nor Awkward have this as a native type. It can't encode empty lists at the _end_ of an array, so another integer would be needed to preserve that information.
  3. Will you want regular arrays? If some dimensions of the array are variable-length (ragged) and some are fixed-length, that can be accomplished by adding a node without any `offsets`/`starts`/`stops`/`parents` buffer, just an integer `size`. Multiple fixed dimensions could be multiple nested nodes (which is easier) or a tuple of integers `shape`. Is the nested data constrained to be contiguous or would you also have `strides`? (They can't be byte-strides, as in NumPy, they have to count numbers of items.) Can the `size` be zero? If so, you'll need another integer for the length of this array. This is pyarrow's `FixedSizeListArray` and Awkward's `RegularArray` (neither of which have `strides`, and only Awkward allows `size=0`).
  4. Will you want to allow for missing data? Only missing numerical values or also missing lists? Some functions naturally return missing data, such as `max` of an `axis` with variable length lists, some of which can be zero length. There's a variety of ways to represent missing data, though in a system of only nested lists, a bit-mask or byte-mask is probably best. All pyarrow node types are potentially missing, represented by a bit-mask, and Awkward as four node types for missing data, including `BitMaskedArray` and `ByteMaskedArray`.

You don't want record-types or union-types, so the only questions are how to implement (2) and whether you want (3) and (4). Including a type, such as missing data, allows for more function return values but obliges you to consider that type for all function arguments. You'll want to choose carefully how you close your system.

_(Maybe this block of details can be copied to an issue where you're doing the development of `RaggedArray` in CloudDrift. It got longer than I had intended.)_","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,667864088
https://github.com/pydata/xarray/issues/4285#issuecomment-1211328405,https://api.github.com/repos/pydata/xarray/issues/4285,1211328405,IC_kwDOAMm_X85IM2eV,1852447,2022-08-10T22:01:55Z,2022-08-10T22:01:55Z,NONE,"This is a wonderful list; thank you!

----------------

> I'm not sure whether the `RaggedArray` class being proposed here would work for that use case [Alleles in Genomics]?

I believe that this use-case benefits from being able to mix regular and ragged dimensions, that the data have 3 regular dimensions and 1 ragged dimension, with the ragged one as the innermost. (The RaggedArray described above has this feature.)

----------------

> > Support for event data, a particular form of sparse data.

I might have been misinterpreting the word ""sparse data"" in conversations about this. I had thought that ""sparse data"" is logically rectilinear but represented in memory with the zeros removed, so the internal machinery has to deal with irregular structures, but the outward API it presents is regular (dimensionality is completely described by a `shape: tuple[int]`). But this usage,

> > 1-D (or N-D) array of random-length lists, with very small list entries.

is definitely what we mean by a ragged array (again with the ragged dimension potentially within zero or more regular dimensions).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,667864088
https://github.com/pydata/xarray/issues/4285#issuecomment-1210718350,https://api.github.com/repos/pydata/xarray/issues/4285,1210718350,IC_kwDOAMm_X85IKhiO,1852447,2022-08-10T14:01:48Z,2022-08-10T14:01:48Z,NONE,"Also on the digression, I just want to clarify where we're coming from, why we did the things we did.

> (Digression: From my perspective part of the problem is that _merely_ generalising numpy arrays to be ragged would have been useful for lots of people, but `awkward.Array` goes a lot further. It also generalises the type system, adds things like Records, and possibly adds [xarray-like features](https://github.com/scikit-hep/awkward/issues/1391). That puts `awkward.Array` in a somewhat ill-defined place within the wider scientific python ecosystem: it's kind of a numpy-like duck array, but can't be treated as one, it's also a more general type system, and it might even get features of higher-level data structures like xarray.)

I can see how minimal extensions of the NumPy array model to include ragged arrays represent the majority of use-cases, though it wouldn't have been enough for our first use-case in particle physics, which looks roughly like this (with made-up numbers):

```python
collision_events = ak.Array([
    {
        ""event"": 0,
        ""primary_vertex"": {""x"": 0.001, ""y"": -0.002, ""z"": 0.3},
        ""electrons"": [
            {""px"": 1.1, ""py"": 2.2, ""pz"": 3.3, ""E"": 4.4, ""EoverP"": 1.07},
            {""px"": 1.0, ""py"": 3.2, ""pz"": 3.4, ""E"": 4.5, ""EoverP"": 0.93},
        ],
        ""muons"": [
            {""px"": 0.1, ""py"": 2.3, ""pz"": 4.3, ""E"": 5.4, ""isolation"": 0},
            {""px"": 1.1, ""py"": 2.2, ""pz"": 3.3, ""E"": 4.4, ""isolation"": 0.9},
            {""px"": 1.1, ""py"": 2.2, ""pz"": 3.3, ""E"": 4.4, ""isolation"": 0.8},
        ],
    },
    {
        ""event"": 1,
        ""primary_vertex"": {""x"": -0.001, ""y"": 0.002, ""z"": 0.4},
        ""electrons"": [
            {""px"": 1.0, ""py"": 3.2, ""pz"": 3.4, ""E"": 4.5, ""EoverP"": 0.93},
        ],
        ""muons"": [],
    },
    ...,
])
```

We needed ""records with differently typed fields"" and ""variable-length lists"" to be nestable within each other. It's even sometimes the case that one of the inner records representing a particle has another variable-length list within it, identifying the indexes of particles in the collision event that it's close to. We deliberated on whether those cross-links should allow the structure to be non-tree-like, either a DAG or to actually have cycles (https://github.com/scikit-hep/awkward/issues/178). The prior art is a C++ infrastructure that _does_ have a full graph model: collision events represented as arbitrary C++ class instances, and those arbitrary C++ data are serialized to disk in exabytes of ROOT files. Our first problem was to get a high-performance representation of these data in Python.

For that, we didn't need missing data or heterogeneous unions (`std::optional` and `std::variant` are rare in particle physics data models), but it seemed like a good idea to complete the type system because functions might close over the larger space of types. That ended up being true for missing data: they've come to be useful in such functions as [ak.fill_none](https://awkward-array.readthedocs.io/en/latest/_auto/ak.fill_none.html) and filtering without changing array lengths ([ak.mask and array.mask](https://awkward-array.readthedocs.io/en/latest/_auto/ak.Array.html#ak-array-mask)). Heterogeneous unions have _not_ been very useful. Some input data produce such types, like GeoJSON with mixed points and polygons, but the first thing one usually wants to do is restructure the data into non-unions.

Another consideration is that this scope exactly matches Apache Arrow (including the lack of cross-references). As such, we can use Arrow as an interchange format and Parquet as a disk format without having to exclude a subspace of types in either direction. We don't use Arrow as an internal format for performance reasons—we have node types that are lazier than Arrow's so they're better as intermediate arrays in a multi-step calculation—but it's important to have one-to-one, minimal computation (sometimes zero-copy) transformations to and from Arrow.

That said, as we've been looking for use-cases beyond particle physics, most of them would be handled well by simple ragged arrays. Also, we've found the ""just ragged arrays"" part of Arrow to be the most developed or at least the first to be developed, driven by SQL-like applications. Our unit tests in Awkward Array have revealed a lot of unhandled cases in Arrow, particularly the Parquet serialization, that we report in JIRA (and they quickly get resolved).

Two possible conclusions:

   1. Ragged arrays are all that's really needed in most sciences; particle physics is a weird exception, and Arrow is over-engineered.
   2. Arrays of general type have as-yet undiscovered utility in most sciences: datasets have been cast into the forms they currently take so that they can make better use of the tools that exist, not because it's the most natural way to frame and use the data. Computing in particle physics had been siloed for decades, not heavily affected by SQL/relational/tidy ways of talking about data: maybe this is like a first contact of foreign cultures that each have something to contribute. (Particle physics analysis has been changing _a lot_ by becoming less bound to an edit-compile-run cycle.)

If it turns out that conclusion (1) is right or more right than (2), then at least a subset of what we're working on is going to be useful to the wider community. If it's (2), though, then it's a big opportunity.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,667864088
https://github.com/pydata/xarray/issues/4285#issuecomment-1210023820,https://api.github.com/repos/pydata/xarray/issues/4285,1210023820,IC_kwDOAMm_X85IH3-M,1852447,2022-08-10T00:36:42Z,2022-08-10T00:36:42Z,NONE,"> If you want a `RaggedArray` class that is more specific (i.e. defines more attributes) than `awkward.Array`, then surely the ""correct"" thing to do would be be to subclass though?

It shouldn't be a subclass because it doesn't satisfy a substitution principle: `ak.combinations(array: ak.Array, n: int) -> ak.Array`, but `ak.combinations(array: RaggedArray, n: int) -> ⊥` (at best, would raise an exception because `RaggedArray` isn't closed under `ak.combinations`).

Since `RaggedArray` can't be used everywhere that an `ak.Array` can be used, it shouldn't be a subclass.

> I mean for eventual integration of `RaggedArray` within awkward's codebase.

Oh.......

I hadn't been thinking that RaggedArray is something we'd put in the general Awkward Array library. I was thinking of it only as a way to define ""the subset of Awkward Arrays that xarray uses,"" which would live in xarray.

I don't want to introduce another level of type-specificity to the system, since that would make things harder to understand. (Imagine reading the docs and it says, ""You can apply this function to ak.Array, but not to ak.RaggedArray."" Or ""this is an ak.Array that happens to be ragged, but not a ak.RaggedArray."")

So let me rethink your original idea of adding `shape` and `dtype` properties to _all_ ak.Arrays. Perhaps they should raise exceptions when the array is not a ragged array? People don't usually expect properties to raise exceptions, and you really need them to be properties with the exact spellings ""`shape`"" and ""`dtype`"" to get what you want.

If that point is negotiable, I could introduce an `ak.shape_dtype(array)` function that returns a shape and dtype if `array` has the right properties and raise an exception if it doesn't. That would be more normal: you're asking if it satisfies a specific constraint, and if so, to get some information about it. Then we would also be able to deal with the fact that

  * some people are going to want the `shape` to specify the maximum of ""var"" dimensions (what you asked for): ""virtually padding"",
  * some people are going to want the `shape` to specify the minimum of ""var"" dimensions because that tells you what upper bounds are legal to slice: ""virtually truncating"",
  * and some people are going to want the string `""var""` or maybe `None` or maybe `np.nan` in place of ""var"" dimensions because no integer is correct. Then they would have to deal with the fact that this `shape` is not a tuple of integers.

Or maybe the best way to present it is with a `min_shape` and a `max_shape`, whose items are equal where the array is regular.

Anyway, you can see why I'm loath to add a property to ak.Array that's just named ""`shape`""? That has the potential for misinterpretation. (Pandas wanted arrays to have a `shape` that is always equal to `(len(array),)`; if we satisfied that constraint, we couldn't satisfy yours!) In fact, ""`dtype`"" of a general array would be misleading, too, though a list of unique ""`dtypes`"" of all the leaf-nodes could be a useful thing to have. (2 shapes and _n_ dtypes!)

But if I'm providing it as an extra function, or as a trio of properties named `min_shape`, `max_shape`, and `dtypes` which are all spelled differently from the `shape` and `dtype` you want, you'd then be forced to wrap it as a RaggedArray type within xarray again, anyway. Which, as a reminder, is what we're doing for Pandas: https://github.com/intake/awkward-pandas lives outside the Awkward codebase and it wraps ak.Array to put them in Series.

So in the end, I just came back to where we started: xarray would own the RaggedArray wrapper. Or it could be a third package, as awkward-pandas is to awkward and pandas.

-------------------------------------

> (I expected `2` and `(3, 2)` respectively). I think perhaps `context[""shape""]` is being overwritten as it recurses through the data structure, when it should be being appended?

No, I initialized it incorrectly: it should have started as

```python
context = {""shape"": [len(array)]}
```

and then recurse from there. My previous example also had the wrong output, but I didn't count square brackets carefully enough to have caught it. (By the way, not copying the context is why it's called ""lateral""; if a copied dict is needed, it's ""depth_context"". I just went back and checked: yes, they're being handled appropriately.)

I fixed the code that I wrote in the comments above for posterity.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,667864088
https://github.com/pydata/xarray/issues/4285#issuecomment-1208723159,https://api.github.com/repos/pydata/xarray/issues/4285,1208723159,IC_kwDOAMm_X85IC6bX,1852447,2022-08-08T23:30:12Z,2022-08-10T00:02:44Z,NONE,"Given that you have an array of only list-type, regular-type, and numpy-type (which the `prepare` function, above, guarantees), here's a one-pass function to get the dtype and shape:

```python
def shape_dtype(layout, lateral_context, **kwargs):
    if layout.is_RegularType:
        lateral_context[""shape""].append(layout.size)
    elif layout.is_ListType:
        max_size = ak.max(ak.num(layout))
        lateral_context[""shape""].append(max_size)
    elif layout.is_NumpyType:
        lateral_context[""dtype""] = layout.dtype
    else:
        raise AssertionError(f""what? {layout.form.type}"")

context = {""shape"": [len(array)]}
array.layout.recursively_apply(
    shape_dtype, lateral_context=context, return_array=False
)
# check context for ""shape"" and ""dtype""
```

Here's the application on an array of mixed regular and irregular lists:

```python
>>> array = ak.to_regular(ak.Array([[[[1, 2, 3], []]], [[[4], [5]], [[6, 7], [8]]]]), axis=2)
>>> print(array.type)
2 * var * 2 * var * int64

>>> context = {""shape"": [len(array)]}
>>> array.layout.recursively_apply(
...     shape_dtype, lateral_context=context, return_array=False
... )
>>> context
{'shape': [2, 2, 2, 3], 'dtype': dtype('int64')}
```

(This `recursively_apply` is a Swiss Army knife for restructuring or getting data out of layouts that we use internally all over the codebase, and intend to make public in v2: https://github.com/scikit-hep/awkward/issues/516.)

To answer your question about monkey-patching, I think it would be best to make a wrapper. You don't want to give _all_ `ak.Array` instances properties named `shape` and `dtype`, since those properties won't make sense for general types. This is exactly [the reason we had to back off](https://github.com/scikit-hep/awkward/issues/350) on making `ak.Array` inherit from `pandas.api.extensions.ExtensionArray`: Pandas wanted it to have methods with names and behaviors that would have been misleading for Awkward Arrays. We think we'll be able to [reintroduce Awkward Arrays as Pandas columns](https://github.com/intake/awkward-pandas) by wrapping them—that's what we're doing differently this time.

Here's a start of a wrapper:

```python
class RaggedArray:
    def __init__(self, array_like):
        layout = ak.to_layout(array_like, allow_record=False, allow_other=False)
        behavior = None
        if isinstance(array_like, ak.Array):
            behavior = array_like.behavior
        self._array = ak.Array(layout.recursively_apply(prepare), behavior=behavior)

        context = {""shape"": [len(self._array)]}
        self._array.layout.recursively_apply(
            shape_dtype, lateral_context=context, return_array=False
        )
        self._shape = context[""shape""]
        self._dtype = context[""dtype""]

    def __repr__(self):
        # this is pretty cheesy
        return ""<Ragged"" + repr(self._array)[1:]

    @property
    def dtype(self):
        return self._dtype

    @property
    def shape(self):
        return self._shape

    def __getitem__(self, where):
        if isinstance(where, RaggedArray):
            where = where._array
        if isinstance(where, tuple):
            where = tuple(x._array if isinstance(x, RaggedArray) else x for x in where)

        out = self._array[where]
        if isinstance(out, ak.Array):
            return RaggedArray(out)
        else:
            return out

    def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
        inputs = [x._array if isinstance(x, RaggedArray) else x for x in inputs]
        out = self._array.__array_ufunc__(ufunc, method, *inputs, **kwargs)
        return RaggedArray(out)

    def sum(self, axis=None, keepdims=False, mask_identity=False):
        out = ak.sum(self._array, axis=axis, keepdims=keepdims, mask_identity=mask_identity)
        if isinstance(out, ak.Array):
            return RaggedArray(out)
        else:
            return out
```

It keeps an `_array` (`ak.Array`), performs all internal operations on the `ak.Array` level (unwrapping RaggedArrays if necessary), but returns RaggedArrays (if the output is not scalar). It handles only the operations you want it to: this one handles all the complex slicing, NumPy ufuncs, and one reducer, `sum`.

Thus, it can act as a gatekeeper of what kinds of operations are allowed: `ak.*` won't recognize RaggedArray, which is good because some `ak.*` functions would take you out of this ""ragged array"" subset of types. You can add some non-ufunc NumPy functions with `__array_function__`, but only the ones that make sense for this subset of types.

---------------------

I meant to say something earlier about why we go for full generality in types: it's because some of the things we want to do, such as [ak.cartesian](https://awkward-array.readthedocs.io/en/latest/_auto/ak.cartesian.html), require more complex types, and as soon as one function needs it, the whole space needs to be enlarged. For the first year of Awkward Array use, most users wanted it for plain ragged arrays (based on their bug-reports and questions), but after about a year, they were asking about missing values and records, too, because you eventually need them unless you intend to work within a narrow set of functions.

Union arrays are still not widely used, but they can come from some file formats. Some GeoJSON files that I looked at had longitude, latitude points in different list depths because some were points and some were polygons, disambiguated by a string label. That's not good to work with (we can't handle that in Numba, for instance), but if you select all points with some slice, put them in one array, and select all polygons with another slice, putting them in their own array, these each become trivial unions, and that's why I added the squashing of trivial unions to the `prepare` function example above.","{""total_count"": 3, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 3, ""rocket"": 0, ""eyes"": 0}",,667864088
https://github.com/pydata/xarray/issues/4285#issuecomment-1208646168,https://api.github.com/repos/pydata/xarray/issues/4285,1208646168,IC_kwDOAMm_X85ICnoY,1852447,2022-08-08T21:46:57Z,2022-08-08T21:46:57Z,NONE,"The passing on of behavior is just to not break applications that depend on it. I did that just for correctness.

Monkey-patching will add the desired properties to the `ak.Array` class (a part of the problem I haven't addressed yet), though it would do so globally for all `ak.Arrays`, including those that are not simple ragged arrays. The function I wrote above would take a general array and simplify it to a ragged array or die trying.

Ragged array is not a specialized subset of types within Awkward Array. There are `ak.*` functions that would take you out of this subset. However (thinking it through...) I don't think slices, ufuncs, or reducers would take you out of this subset.

More later...","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,667864088
https://github.com/pydata/xarray/issues/4285#issuecomment-1208568777,https://api.github.com/repos/pydata/xarray/issues/4285,1208568777,IC_kwDOAMm_X85ICUvJ,1852447,2022-08-08T20:18:30Z,2022-08-08T20:18:30Z,NONE,"You mentioned union arrays, but for completeness, the type system in Awkward Array has

  * numeric primitives: `issubclass(dtype.type, (np.bool_, np.number, np.datetime64, np.timedelta64))` (including complex)
  * variable-length lists
  * regular-length lists
  * missing data (through masks and indexes, not NaNs)
  * record structures
  * heterogeneous unions

You're interested in a subset of this type system, but that subset doesn't just exclude unions, it also excludes records. If you have an xarray, you don't need top-level records since those could just be the columns of an xarray, but some data source might provide records nested within variable-length lists (very common in HEP) or other nesting. It would have to be explicitly excluded.

That leaves the possibility of missing lists and missing numeric primitives. Missing lists could be turned into empty lists (Google projects like Protocol Buffers often make that equivalence) and missing numbers could be turned into NaN if you're willing to lose integer-ness.

Here's a way to determine if an `array` (Python type `ak.Array`) is in that subset and to pre-process it, ensuring that you only have numbers, variable-length, and regular-length lists (in Awkward version 2, so note the ""`_v2`""):

```python
import awkward._v2 as ak
import numpy as np

def prepare(layout, continuation, **kwargs):
    if layout.is_RecordType:
        raise NotImplementedError(""no records!"")
    elif layout.is_UnionType:
        if len(layout) == 0 or np.all(layout.tags) == layout.tags[0]:
            return layout.project(layout.tags[0]).recursively_apply(prepare)
        else:
            raise NotImplementedError(""no non-trivial unions!"")
    elif layout.is_OptionType:
        next = continuation()  # fully recurse
        content_type = next.content.form.type
        if isinstance(content_type, ak.types.NumpyType):
            return ak.fill_none(next, np.nan, axis=0, highlevel=False)
        elif isinstance(content_type, ak.types.ListType):
            return ak.fill_none(next, [], axis=0, highlevel=False)
        elif isinstance(content_type, ak.types.RegularType):
            return ak.fill_none(next.toListOffsetArray64(False), [], axis=0, highlevel=False)
        else:
            raise AssertionError(f""what? {content_type}"")

ak.Array(array.layout.recursively_apply(prepare), behavior=array.behavior)
```

It should catch all the cases and doesn't rely on string-processing the type's DataShape representation.

Given that you're working within that subset, it would be possible to define `shape` with some token for the variable-length dimensions and `dtype`. I can follow up with another message (I have to deal with something else at the moment).

Oh, if you're replacing variable-length dimensions with the maximum length in that dimension, what about actually padding the array with [ak.pad_none](https://awkward-array.readthedocs.io/en/latest/_auto/ak.pad_none.html)?

```python
ak.fill_none(ak.pad_none(array, ak.max(ak.num(array))), np.nan)
```

The above would have to be expanded to get every `axis`, but it makes all nested lists have the length of the longest one by padding with `None`, then replaces those `None` values with `np.nan`. That uses all the memory of a padded array, but it's what people use now if they want to convert Awkward data into non-Awkward data (maybe passing the final step to [ak.to_numpy](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_numpy.html)).","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 1}",,667864088
https://github.com/pydata/xarray/issues/4285#issuecomment-1203295236,https://api.github.com/repos/pydata/xarray/issues/4285,1203295236,IC_kwDOAMm_X85HuNQE,1852447,2022-08-02T23:03:16Z,2022-08-02T23:03:16Z,NONE,Hi! I will be looking deeply into this when I get back from traveling (next week). Just to let you know that I saw this and I'm interested. Thanks!,"{""total_count"": 2, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 2, ""rocket"": 0, ""eyes"": 0}",,667864088
https://github.com/pydata/xarray/issues/5648#issuecomment-890260105,https://api.github.com/repos/pydata/xarray/issues/5648,890260105,IC_kwDOAMm_X841EEqJ,1852447,2021-07-31T00:06:14Z,2021-07-31T00:06:14Z,NONE,"> * Awkward? (@jpivarski)

I'm interested. Let us know when the time will be or if there's a poll for picking a time. Thanks!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,956103236
https://github.com/pydata/xarray/issues/4285#issuecomment-707321343,https://api.github.com/repos/pydata/xarray/issues/4285,707321343,MDEyOklzc3VlQ29tbWVudDcwNzMyMTM0Mw==,1852447,2020-10-12T20:08:32Z,2020-10-12T20:08:32Z,NONE,"Copied from https://gitter.im/pangeo-data/Lobby :

I've been using Xarray with argopy recently, and the immediate value I see is the documentation of columns, which is semi-lacking in Awkward (one user has been passing this information through an Awkward tree as a scikit-hep/awkward-1.0#422). I should also look into Xarray's indexing, which I've always seen as being the primary difference between NumPy and Pandas; Awkward Array has no indexing, though every node has an optional [Identities](https://awkward-array.readthedocs.io/en/latest/ak.layout.Identities.html) which would be used to track such information through Awkward manipulations—Identities would have a bijection with externally supplied indexes. They haven't been used for anything yet.

Although the elevator pitch for Xarray is ""n-dimensional Pandas,"" it's rather different, isn't it? The contextual metadata is more extensive than anything I've seen in Pandas, and Xarray can be partitioned for out-of-core analysis: Xarray wraps Dask, unlike Dask's array collection, which wraps NumPy. I had troubles getting Pandas to wrap Awkward array (scikit-hep/awkward-1.0#350 ), but maybe these won't be issues for Xarray.

One last thing (in this very rambly message): the main difficulty I think we would have in that is that Awkward Arrays don't have shape and dtype, since those define a rectilinear array of numbers. The data model is [Datashape](https://datashape.readthedocs.io/) plus [union types](https://github.com/blaze/datashape/issues/237). There is a sense in which ndim is defined: the number of nested lists before reaching the first record, which may split it into different depths for each field, but even this can be ill-defined with union types:

```python
>>> import awkward1 as ak
>>> array = ak.Array([1, 2, [3, 4, 5], [[6, 7, 8]]])
>>> array
<Array [1, 2, [3, 4, 5], [[6, 7, 8]]] type='4 * union[int64, var * union[int64, ...'>
>>> array.type
4 * union[int64, var * union[int64, var * int64]]
>>> array.ndim
-1
```

So if we wanted to have an Xarray of Awkward Arrays, we'd have to take stock of all the assumptions Xarray makes about the arrays it contains.","{""total_count"": 5, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 5, ""rocket"": 0, ""eyes"": 0}",,667864088
https://github.com/pydata/xarray/issues/4285#issuecomment-665740365,https://api.github.com/repos/pydata/xarray/issues/4285,665740365,MDEyOklzc3VlQ29tbWVudDY2NTc0MDM2NQ==,1852447,2020-07-29T15:40:24Z,2020-07-29T15:40:24Z,NONE,"I'm linking myself here, to follow this: @jpivarski.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,667864088