issue_comments: 1208568777

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/4285#issuecomment-1208568777	https://api.github.com/repos/pydata/xarray/issues/4285	1208568777	IC_kwDOAMm_X85ICUvJ	1852447	2022-08-08T20:18:30Z	2022-08-08T20:18:30Z	NONE	You mentioned union arrays, but for completeness, the type system in Awkward Array has numeric primitives: `issubclass(dtype.type, (np.bool_, np.number, np.datetime64, np.timedelta64))` (including complex) variable-length lists regular-length lists missing data (through masks and indexes, not NaNs) record structures heterogeneous unions You're interested in a subset of this type system, but that subset doesn't just exclude unions, it also excludes records. If you have an xarray, you don't need top-level records since those could just be the columns of an xarray, but some data source might provide records nested within variable-length lists (very common in HEP) or other nesting. It would have to be explicitly excluded. That leaves the possibility of missing lists and missing numeric primitives. Missing lists could be turned into empty lists (Google projects like Protocol Buffers often make that equivalence) and missing numbers could be turned into NaN if you're willing to lose integer-ness. Here's a way to determine if an `array` (Python type `ak.Array`) is in that subset and to pre-process it, ensuring that you only have numbers, variable-length, and regular-length lists (in Awkward version 2, so note the "`_v2`"): ```python import awkward._v2 as ak import numpy as np def prepare(layout, continuation, **kwargs): if layout.is_RecordType: raise NotImplementedError("no records!") elif layout.is_UnionType: if len(layout) == 0 or np.all(layout.tags) == layout.tags[0]: return layout.project(layout.tags[0]).recursively_apply(prepare) else: raise NotImplementedError("no non-trivial unions!") elif layout.is_OptionType: next = continuation() # fully recurse content_type = next.content.form.type if isinstance(content_type, ak.types.NumpyType): return ak.fill_none(next, np.nan, axis=0, highlevel=False) elif isinstance(content_type, ak.types.ListType): return ak.fill_none(next, [], axis=0, highlevel=False) elif isinstance(content_type, ak.types.RegularType): return ak.fill_none(next.toListOffsetArray64(False), [], axis=0, highlevel=False) else: raise AssertionError(f"what? {content_type}") ak.Array(array.layout.recursively_apply(prepare), behavior=array.behavior) ``` It should catch all the cases and doesn't rely on string-processing the type's DataShape representation. Given that you're working within that subset, it would be possible to define `shape` with some token for the variable-length dimensions and `dtype`. I can follow up with another message (I have to deal with something else at the moment). Oh, if you're replacing variable-length dimensions with the maximum length in that dimension, what about actually padding the array with ak.pad_none? `python ak.fill_none(ak.pad_none(array, ak.max(ak.num(array))), np.nan)` The above would have to be expanded to get every `axis`, but it makes all nested lists have the length of the longest one by padding with `None`, then replaces those `None` values with `np.nan`. That uses all the memory of a padded array, but it's what people use now if they want to convert Awkward data into non-Awkward data (maybe passing the final step to ak.to_numpy).	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 1 }		667864088