home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 1210718350

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4285#issuecomment-1210718350 https://api.github.com/repos/pydata/xarray/issues/4285 1210718350 IC_kwDOAMm_X85IKhiO 1852447 2022-08-10T14:01:48Z 2022-08-10T14:01:48Z NONE

Also on the digression, I just want to clarify where we're coming from, why we did the things we did.

(Digression: From my perspective part of the problem is that merely generalising numpy arrays to be ragged would have been useful for lots of people, but awkward.Array goes a lot further. It also generalises the type system, adds things like Records, and possibly adds xarray-like features. That puts awkward.Array in a somewhat ill-defined place within the wider scientific python ecosystem: it's kind of a numpy-like duck array, but can't be treated as one, it's also a more general type system, and it might even get features of higher-level data structures like xarray.)

I can see how minimal extensions of the NumPy array model to include ragged arrays represent the majority of use-cases, though it wouldn't have been enough for our first use-case in particle physics, which looks roughly like this (with made-up numbers):

python collision_events = ak.Array([ { "event": 0, "primary_vertex": {"x": 0.001, "y": -0.002, "z": 0.3}, "electrons": [ {"px": 1.1, "py": 2.2, "pz": 3.3, "E": 4.4, "EoverP": 1.07}, {"px": 1.0, "py": 3.2, "pz": 3.4, "E": 4.5, "EoverP": 0.93}, ], "muons": [ {"px": 0.1, "py": 2.3, "pz": 4.3, "E": 5.4, "isolation": 0}, {"px": 1.1, "py": 2.2, "pz": 3.3, "E": 4.4, "isolation": 0.9}, {"px": 1.1, "py": 2.2, "pz": 3.3, "E": 4.4, "isolation": 0.8}, ], }, { "event": 1, "primary_vertex": {"x": -0.001, "y": 0.002, "z": 0.4}, "electrons": [ {"px": 1.0, "py": 3.2, "pz": 3.4, "E": 4.5, "EoverP": 0.93}, ], "muons": [], }, ..., ])

We needed "records with differently typed fields" and "variable-length lists" to be nestable within each other. It's even sometimes the case that one of the inner records representing a particle has another variable-length list within it, identifying the indexes of particles in the collision event that it's close to. We deliberated on whether those cross-links should allow the structure to be non-tree-like, either a DAG or to actually have cycles (https://github.com/scikit-hep/awkward/issues/178). The prior art is a C++ infrastructure that does have a full graph model: collision events represented as arbitrary C++ class instances, and those arbitrary C++ data are serialized to disk in exabytes of ROOT files. Our first problem was to get a high-performance representation of these data in Python.

For that, we didn't need missing data or heterogeneous unions (std::optional and std::variant are rare in particle physics data models), but it seemed like a good idea to complete the type system because functions might close over the larger space of types. That ended up being true for missing data: they've come to be useful in such functions as ak.fill_none and filtering without changing array lengths (ak.mask and array.mask). Heterogeneous unions have not been very useful. Some input data produce such types, like GeoJSON with mixed points and polygons, but the first thing one usually wants to do is restructure the data into non-unions.

Another consideration is that this scope exactly matches Apache Arrow (including the lack of cross-references). As such, we can use Arrow as an interchange format and Parquet as a disk format without having to exclude a subspace of types in either direction. We don't use Arrow as an internal format for performance reasons—we have node types that are lazier than Arrow's so they're better as intermediate arrays in a multi-step calculation—but it's important to have one-to-one, minimal computation (sometimes zero-copy) transformations to and from Arrow.

That said, as we've been looking for use-cases beyond particle physics, most of them would be handled well by simple ragged arrays. Also, we've found the "just ragged arrays" part of Arrow to be the most developed or at least the first to be developed, driven by SQL-like applications. Our unit tests in Awkward Array have revealed a lot of unhandled cases in Arrow, particularly the Parquet serialization, that we report in JIRA (and they quickly get resolved).

Two possible conclusions:

  1. Ragged arrays are all that's really needed in most sciences; particle physics is a weird exception, and Arrow is over-engineered.
  2. Arrays of general type have as-yet undiscovered utility in most sciences: datasets have been cast into the forms they currently take so that they can make better use of the tools that exist, not because it's the most natural way to frame and use the data. Computing in particle physics had been siloed for decades, not heavily affected by SQL/relational/tidy ways of talking about data: maybe this is like a first contact of foreign cultures that each have something to contribute. (Particle physics analysis has been changing a lot by becoming less bound to an edit-compile-run cycle.)

If it turns out that conclusion (1) is right or more right than (2), then at least a subset of what we're working on is going to be useful to the wider community. If it's (2), though, then it's a big opportunity.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  667864088
Powered by Datasette · Queries took 0.591ms · About: xarray-datasette