home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 1208568777

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4285#issuecomment-1208568777 https://api.github.com/repos/pydata/xarray/issues/4285 1208568777 IC_kwDOAMm_X85ICUvJ 1852447 2022-08-08T20:18:30Z 2022-08-08T20:18:30Z NONE

You mentioned union arrays, but for completeness, the type system in Awkward Array has

  • numeric primitives: issubclass(dtype.type, (np.bool_, np.number, np.datetime64, np.timedelta64)) (including complex)
  • variable-length lists
  • regular-length lists
  • missing data (through masks and indexes, not NaNs)
  • record structures
  • heterogeneous unions

You're interested in a subset of this type system, but that subset doesn't just exclude unions, it also excludes records. If you have an xarray, you don't need top-level records since those could just be the columns of an xarray, but some data source might provide records nested within variable-length lists (very common in HEP) or other nesting. It would have to be explicitly excluded.

That leaves the possibility of missing lists and missing numeric primitives. Missing lists could be turned into empty lists (Google projects like Protocol Buffers often make that equivalence) and missing numbers could be turned into NaN if you're willing to lose integer-ness.

Here's a way to determine if an array (Python type ak.Array) is in that subset and to pre-process it, ensuring that you only have numbers, variable-length, and regular-length lists (in Awkward version 2, so note the "_v2"):

```python import awkward._v2 as ak import numpy as np

def prepare(layout, continuation, **kwargs): if layout.is_RecordType: raise NotImplementedError("no records!") elif layout.is_UnionType: if len(layout) == 0 or np.all(layout.tags) == layout.tags[0]: return layout.project(layout.tags[0]).recursively_apply(prepare) else: raise NotImplementedError("no non-trivial unions!") elif layout.is_OptionType: next = continuation() # fully recurse content_type = next.content.form.type if isinstance(content_type, ak.types.NumpyType): return ak.fill_none(next, np.nan, axis=0, highlevel=False) elif isinstance(content_type, ak.types.ListType): return ak.fill_none(next, [], axis=0, highlevel=False) elif isinstance(content_type, ak.types.RegularType): return ak.fill_none(next.toListOffsetArray64(False), [], axis=0, highlevel=False) else: raise AssertionError(f"what? {content_type}")

ak.Array(array.layout.recursively_apply(prepare), behavior=array.behavior) ```

It should catch all the cases and doesn't rely on string-processing the type's DataShape representation.

Given that you're working within that subset, it would be possible to define shape with some token for the variable-length dimensions and dtype. I can follow up with another message (I have to deal with something else at the moment).

Oh, if you're replacing variable-length dimensions with the maximum length in that dimension, what about actually padding the array with ak.pad_none?

python ak.fill_none(ak.pad_none(array, ak.max(ak.num(array))), np.nan)

The above would have to be expanded to get every axis, but it makes all nested lists have the length of the longest one by padding with None, then replaces those None values with np.nan. That uses all the memory of a padded array, but it's what people use now if they want to convert Awkward data into non-Awkward data (maybe passing the final step to ak.to_numpy).

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
  667864088
Powered by Datasette · Queries took 6.007ms · About: xarray-datasette