issue_comments: 1210718350

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/4285#issuecomment-1210718350	https://api.github.com/repos/pydata/xarray/issues/4285	1210718350	IC_kwDOAMm_X85IKhiO	1852447	2022-08-10T14:01:48Z	2022-08-10T14:01:48Z	NONE	Also on the digression, I just want to clarify where we're coming from, why we did the things we did. (Digression: From my perspective part of the problem is that merely generalising numpy arrays to be ragged would have been useful for lots of people, but `awkward.Array` goes a lot further. It also generalises the type system, adds things like Records, and possibly adds xarray-like features. That puts `awkward.Array` in a somewhat ill-defined place within the wider scientific python ecosystem: it's kind of a numpy-like duck array, but can't be treated as one, it's also a more general type system, and it might even get features of higher-level data structures like xarray.) I can see how minimal extensions of the NumPy array model to include ragged arrays represent the majority of use-cases, though it wouldn't have been enough for our first use-case in particle physics, which looks roughly like this (with made-up numbers): python collision_events = ak.Array([ { "event": 0, "primary_vertex": {"x": 0.001, "y": -0.002, "z": 0.3}, "electrons": [ {"px": 1.1, "py": 2.2, "pz": 3.3, "E": 4.4, "EoverP": 1.07}, {"px": 1.0, "py": 3.2, "pz": 3.4, "E": 4.5, "EoverP": 0.93}, ], "muons": [ {"px": 0.1, "py": 2.3, "pz": 4.3, "E": 5.4, "isolation": 0}, {"px": 1.1, "py": 2.2, "pz": 3.3, "E": 4.4, "isolation": 0.9}, {"px": 1.1, "py": 2.2, "pz": 3.3, "E": 4.4, "isolation": 0.8}, ], }, { "event": 1, "primary_vertex": {"x": -0.001, "y": 0.002, "z": 0.4}, "electrons": [ {"px": 1.0, "py": 3.2, "pz": 3.4, "E": 4.5, "EoverP": 0.93}, ], "muons": [], }, ..., ]) We needed "records with differently typed fields" and "variable-length lists" to be nestable within each other. It's even sometimes the case that one of the inner records representing a particle has another variable-length list within it, identifying the indexes of particles in the collision event that it's close to. We deliberated on whether those cross-links should allow the structure to be non-tree-like, either a DAG or to actually have cycles (https://github.com/scikit-hep/awkward/issues/178). The prior art is a C++ infrastructure that does have a full graph model: collision events represented as arbitrary C++ class instances, and those arbitrary C++ data are serialized to disk in exabytes of ROOT files. Our first problem was to get a high-performance representation of these data in Python. For that, we didn't need missing data or heterogeneous unions (`std::optional` and `std::variant` are rare in particle physics data models), but it seemed like a good idea to complete the type system because functions might close over the larger space of types. That ended up being true for missing data: they've come to be useful in such functions as ak.fill_none and filtering without changing array lengths (ak.mask and array.mask). Heterogeneous unions have not been very useful. Some input data produce such types, like GeoJSON with mixed points and polygons, but the first thing one usually wants to do is restructure the data into non-unions. Another consideration is that this scope exactly matches Apache Arrow (including the lack of cross-references). As such, we can use Arrow as an interchange format and Parquet as a disk format without having to exclude a subspace of types in either direction. We don't use Arrow as an internal format for performance reasons—we have node types that are lazier than Arrow's so they're better as intermediate arrays in a multi-step calculation—but it's important to have one-to-one, minimal computation (sometimes zero-copy) transformations to and from Arrow. That said, as we've been looking for use-cases beyond particle physics, most of them would be handled well by simple ragged arrays. Also, we've found the "just ragged arrays" part of Arrow to be the most developed or at least the first to be developed, driven by SQL-like applications. Our unit tests in Awkward Array have revealed a lot of unhandled cases in Arrow, particularly the Parquet serialization, that we report in JIRA (and they quickly get resolved). Two possible conclusions: Ragged arrays are all that's really needed in most sciences; particle physics is a weird exception, and Arrow is over-engineered. Arrays of general type have as-yet undiscovered utility in most sciences: datasets have been cast into the forms they currently take so that they can make better use of the tools that exist, not because it's the most natural way to frame and use the data. Computing in particle physics had been siloed for decades, not heavily affected by SQL/relational/tidy ways of talking about data: maybe this is like a first contact of foreign cultures that each have something to contribute. (Particle physics analysis has been changing a lot by becoming less bound to an edit-compile-run cycle.) If it turns out that conclusion (1) is right or more right than (2), then at least a subset of what we're working on is going to be useful to the wider community. If it's (2), though, then it's a big opportunity.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		667864088