github: issue_comments: 37 rows where issue = 667864088 sorted by updated

37 rows where issue = 667864088 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
1288374461	https://github.com/pydata/xarray/issues/4285#issuecomment-1288374461	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85Mywi9	SimonHeybrock 12912489	2022-10-24T03:44:44Z	2022-11-03T17:04:15Z	NONE	Also note the Ragged Array Summit on Scientific Python.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1302293898	https://github.com/pydata/xarray/issues/4285#issuecomment-1302293898	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85Nn22K	TomNicholas 35968931	2022-11-03T15:34:57Z	2022-11-03T15:34:57Z	MEMBER	The email that you have listed here doesn't work (bounced back). Oops - use thomas dot nicholas at columbia dot edu please!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1302254100	https://github.com/pydata/xarray/issues/4285#issuecomment-1302254100	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85NntIU	jpivarski 1852447	2022-11-03T15:07:36Z	2022-11-03T15:07:36Z	NONE	Send me an email address, and I'll send you the Zoom URL. The email that you have listed here: http://tom-nicholas.com/contact/ doesn't work (bounced back).	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1302240686	https://github.com/pydata/xarray/issues/4285#issuecomment-1302240686	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85Nnp2u	TomNicholas 35968931	2022-11-03T14:58:11Z	2022-11-03T14:58:11Z	MEMBER	I should be able to join today as well @jpivarski ! Will need the zoom address	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1297615976	https://github.com/pydata/xarray/issues/4285#issuecomment-1297615976	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85NWAxo	jpivarski 1852447	2022-10-31T20:04:12Z	2022-10-31T20:04:12Z	NONE	@milancurcic, @joshmoore, and I are all available on Thursday, November 3 at 11am U.S. Central (12pm U.S. Eastern/Florida, 5pm Central European/Germany: note the unusual U.S.-Europe difference this week, 16:00 UTC). Let's meet then! I sent a Google calendar invitation to both of you at that time, which contains a Zoom URL. If anyone else is interested, let me know and I'll send you the Zoom URL as well (just not on a public GitHub comment).	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1295475130	https://github.com/pydata/xarray/issues/4285#issuecomment-1295475130	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85NN2G6	jpivarski 1852447	2022-10-28T21:15:55Z	2022-10-28T21:15:55Z	NONE	What do you think should be the next step? Should we plan a video call to explore options? Everyone who is interested in this, but particularly @milancurcic, please fill out this poll: https://www.when2meet.com/?17481732-uGwNn and we'll meet by Zoom (URL to be distributed later) to talk about RaggedArray. I'll pick a best time from these responses on Monday. Thanks!	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1287028512	https://github.com/pydata/xarray/issues/4285#issuecomment-1287028512	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85Mtn8g	TomNicholas 35968931	2022-10-21T14:15:44Z	2022-10-21T14:15:44Z	MEMBER	That sounds extremely exciting @milancurcic ! Someone dedicated who wants to make a widely-useful tool is exactly what is needed. I think there are many technical questions (and tbh I didn't really follow a lot of the details of your last comment @jpivarski), but the answers to those will likely depend on intended use cases. I'm happy to attend a video call to discuss this, and think that organising one with people interested in ragged arrays and xarray across disciplines would be a sensible next step. (You should also advertise such a meeting on the pangeo discourse - we could start a new pangeo working group like this if it goes well.)	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1283416324	https://github.com/pydata/xarray/issues/4285#issuecomment-1283416324	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85Mf2EE	SimonHeybrock 12912489	2022-10-19T04:39:06Z	2022-10-19T04:39:06Z	NONE	A possibly relevant distinction that had not occurred to me previously is the example by @milancurcic: If I understand this correctly then this type of data is essentially an array of variable-length time-series (essentially a list of lists?), i.e., there is an order within each inner list. This is conceptually different from the data I am typically dealing with, where each inner list is a list of records without specific ordering.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1283043390	https://github.com/pydata/xarray/issues/4285#issuecomment-1283043390	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85MebA-	jpivarski 1852447	2022-10-18T21:46:29Z	2022-10-18T21:46:29Z	NONE	This sounds good to me! `RaggedArray` can be a well-defined subset of types that have clear use-cases (the ones @TomNicholas listed). The thing I was worried about is that most functions in Awkward Array ignore the boundary line between `RaggedArray` and non-`RaggedArray`; defining it as a new type with its own collection of functions (or just methods) in CloudDrift isolates it in a way that you can ensure that your functions stay within that boundary. To represent a `RaggedArray` without wrapping an Awkward Array, you can store it as a sequence of n offsets arrays for depth-n lists of lists and numerical contents. (Whereas an Awkward Array is a tree of 1D buffers, `RaggedArray` can be a sequence of 1D buffers.) We can then make sure that when you do want to convert it to and from Awkward Arrays, you can do so in a zero-copy way. That way, if you want to define some functions, particularly `__getitem__` with complex slices, by calling the corresponding Awkward function, you can do it by converting and then converting back, knowing that the conversion overhead is all O(1). (Same for `xarray.Dataset`!) I'm in favor of a video call meeting to discuss this. In general, I'm busiest on U.S. mornings, on Wednesday and Thursday, but perhaps you can send a when2meet or equivalent poll? One thing that could be discussed in writing (maybe more easily) is what data types you would consider in scope for `RaggedArray`. (I've reminded myself of the use-cases above, but it doesn't fully answer this question.) That is, You'll want the numerical data, the end of your sequence of 1D buffers, to be arbitrary NumPy types or some reasonable subset. That's a given. You'll want ragged arrays of those. Ragged arrays can be represented in several ways: `offsets` buffer whose length is 1 more than the length of the array. Every neighboring pair of integers is the starting and stopping index of the content of a nested list. The integers must be non-decreasing since they are the cumulative sum of list lengths. This is pyarrow's `ListArray` and Awkward's `ListOffsetArray`. `starts` and `stops` buffers of the same length as the length of the array. Every `starts[i]` and `stops[i]` is the starting and stopping index of the content of a nested list. These may be equivalent to an `offsets` buffer or they can be in a random order, not cover all of the content, or cover the content multiple times. The value of such a thing is that reordering, filtering, or duplicating the set of lists is not an operation that needs to propagate through every level of the sequence of buffers, so it's good for intermediate calculations. pyarrow has no equivalent, but it's Awkward's `ListArray`. a `parents` buffer with the same length as the content; each `parents[j]` indicates which list `j` the `content[j]` belongs to. They may be contiguous or not. This is a Normal Form in database renormalization, and it's useful for some operations that propagate upward, such as reducers (sum, max, etc.). Neither pyarrow nor Awkward have this as a native type. It can't encode empty lists at the end of an array, so another integer would be needed to preserve that information. Will you want regular arrays? If some dimensions of the array are variable-length (ragged) and some are fixed-length, that can be accomplished by adding a node without any `offsets`/`starts`/`stops`/`parents` buffer, just an integer `size`. Multiple fixed dimensions could be multiple nested nodes (which is easier) or a tuple of integers `shape`. Is the nested data constrained to be contiguous or would you also have `strides`? (They can't be byte-strides, as in NumPy, they have to count numbers of items.) Can the `size` be zero? If so, you'll need another integer for the length of this array. This is pyarrow's `FixedSizeListArray` and Awkward's `RegularArray` (neither of which have `strides`, and only Awkward allows `size=0`). Will you want to allow for missing data? Only missing numerical values or also missing lists? Some functions naturally return missing data, such as `max` of an `axis` with variable length lists, some of which can be zero length. There's a variety of ways to represent missing data, though in a system of only nested lists, a bit-mask or byte-mask is probably best. All pyarrow node types are potentially missing, represented by a bit-mask, and Awkward as four node types for missing data, including `BitMaskedArray` and `ByteMaskedArray`. You don't want record-types or union-types, so the only questions are how to implement (2) and whether you want (3) and (4). Including a type, such as missing data, allows for more function return values but obliges you to consider that type for all function arguments. You'll want to choose carefully how you close your system. (Maybe this block of details can be copied to an issue where you're doing the development of `RaggedArray` in CloudDrift. It got longer than I had intended.)	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1282946332	https://github.com/pydata/xarray/issues/4285#issuecomment-1282946332	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85MeDUc	milancurcic 4133310	2022-10-18T20:11:30Z	2022-10-18T20:12:51Z	NONE	Hi All, Thank you for the detailed discussion and thank you @TomNicholas for pointing it out to me. I read the thread last week and have been digesting it. There are many details that go over my head and will keep re-reading them to develop a better understanding of the problem. Two weeks ago I started working part-time on CloudDrift. This is an NSF EarthCube-funded project led by @selipot. @philippemiron was the lead developer in the first year of the project and he laid the foundation of the data structure that we need and example notebooks. The project's purpose is to make working with Lagrangian data (primarily ocean but generalizable to other kinds) easier for scientists who consume such data while also optimizing the storage of such data. This is use case 1 in Tom's list of use cases here. Clouddrift currently provides an implementation of a `RaggedArray` class. Once instantiated with user-provided data (a collection of variable-length arrays, either manually or from dataset-specific adapters), this class allows you to get either an `awkward.Array` or an `xarray.Dataset`, and from there store to a parquet file (via awkward) or a NetCDF file (via xarray). On either end (awkward or xarray), you get the indexing convenience that comes with these libraries, and once indexed you get the NumPy set of functionality. So, `RaggedArray` serves as an intermediate structure to get you to an `awkward.Array` or an `xarray.Dataset` representations of the data, but it does not itself wrap either. Other goals of the project include providing example and tutorial notebooks, writing adapters for canonical ocean Lagrangian datasets, writing methods for oceanographic diagnostics, and more general developer/scientist advocacy kind of work. I am very much interested in making our `RaggedArray` class more generally useful in other fields and use cases. I am also interested in designing and implementing it toward a closer integration with xarray, since there seems to be an appetite for that. `clouddrift.RaggedArray` becoming part of xarray (via core or contrib or otherwise) would be a success story for us. However, I will need help from all of you here given your deep understanding of the internals of awkward and xarray to make it work. I'll be paid half of my day-job salary to work on this for the next two years. So, at least you know that somebody will be committing time to it, but again, I will need guidance. What do you think should be the next step? Should we plan a video call to explore options?	{ "total_count": 2, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 2, "eyes": 0 }	Awkward array backend? 667864088
1254775144	https://github.com/pydata/xarray/issues/4285#issuecomment-1254775144	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85Kyllo	joshmoore 88113	2022-09-22T09:36:26Z	2022-09-22T09:36:26Z	NONE	Definitely interested and interestedly watching 🍿 jpivarski commented on Aug 10 It shouldn't be a subclass because it doesn't satisfy a substitution principle: ak.combinations(array: ak.Array, n: int) -> ak.Array, but ak.combinations(array: RaggedArray, n: int) -> ⊥ (at best, would raise an exception because RaggedArray isn't closed under ak.combinations). A question from the perspective of Zarr (v3), does it make sense to think of this potential `RaggedArray` as a base extension that `AwkwardArray` could then build on top of? (i.e. the reverse) Or more something to keep separate and it's just a matter of the same library could be used to read (de/serialize) either? More generally, :heart: for all of this, and interested to see how I/we can possibly help.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1252772587	https://github.com/pydata/xarray/issues/4285#issuecomment-1252772587	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85Kq8rr	jakirkham 3019665	2022-09-20T18:48:47Z	2022-09-20T18:48:47Z	NONE	cc @ivirshup @joshmoore (who may be interested in this as well)	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1227649280	https://github.com/pydata/xarray/issues/4285#issuecomment-1227649280	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85JLHEA	pbranson 13491008	2022-08-25T19:01:39Z	2022-08-25T19:01:39Z	NONE	Just adding another use-case for this discussion, Argo float data. These are oceanographic instruments that vertically profile the ocean, and the length of each profile changes: https://argopy.readthedocs.io/en/latest/data_fetching.html	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1227114966	https://github.com/pydata/xarray/issues/4285#issuecomment-1227114966	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85JJEnW	benbovy 4160723	2022-08-25T11:09:06Z	2022-08-25T11:09:06Z	MEMBER	Does anyone see any other potential use case? Similarly to @TomNicholas' oceanography observation data example, I was thinking about trajectories or more generally any collection of geospatial "features". An alternative approach to "ragged" arrays (i.e., arrays of lists as described by @SimonHeybrock) would be to have regular arrays with opaque objects as elements. For example, pygeos (now merged in shapely 2.0) provides vectorized operations (numpy ufuncs) for arrays of `Geometry` objects (points, polylines, polygon, etc.) and is used as a backend for spatial operations in geopandas data frames (side note: that would probably be useful in xarray too). This approach is more specific than "ragged" arrays (it heavily relies on the tools provided for dealing with the opaque objects) and in some cases it might not be ideal, e.g., when the data collected for the geospatial features is not "geometry invariant" (like for the trajectories of buoys floating and drifting in the oceans). But perhaps both approach may co-exist in a single xarray Dataset? (e.g., feature geometry as coordinates of shapely object arrays and feature data as ragged arrays, if necessary). This seems hard. Xarray's whole model is built assuming that dims has type Mapping[Hashable, int]. It also breaks our normal concept of alignment, which we need to put coordinate variables in DataArrays alongside data variables. Scipp's ragged data can be considered a "partial sorting", to build a sort of "index". Sounds like a good use case for Xarray explicit indexes to support some of Scipp's features. I can imagine a "ragged" array coordinate coupled with a custom Xarray index to perform data selection (slicing) and alignment along the "regular" dimension(s) of the ragged array.	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1216208075	https://github.com/pydata/xarray/issues/4285#issuecomment-1216208075	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IfdzL	SimonHeybrock 12912489	2022-08-16T06:38:32Z	2022-08-16T06:42:28Z	NONE	@jpivarski Support for event data, a particular form of sparse data. I might have been misinterpreting the word "sparse data" in conversations about this. I had thought that "sparse data" is logically rectilinear but represented in memory with the zeros removed, so the internal machinery has to deal with irregular structures, but the outward API it presents is regular (dimensionality is completely described by a `shape: tuple[int]`). You are right that "sparse" is misleading. Since it is indeed most commonly used for sparse matrix/array representations we are now usually avoiding this term (and refer to it as binned data, or ragged data instead). Obviously our title page needs an update 😬 . logically rectilinear This does actually apply to Scipp's binned data. A `scipp.Variable` may have `shape=(N,M)` and be "ragged". But the "ragged" dimension is in addition to the two regular dimensions. That is, in this case we have (conceptually) a 2-D array of lists.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1216107702	https://github.com/pydata/xarray/issues/4285#issuecomment-1216107702	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IfFS2	SimonHeybrock 12912489	2022-08-16T03:43:29Z	2022-08-16T05:11:50Z	NONE	Generalise xarray to allow for variable-length dimensions This seems hard. Xarray's whole model is built assuming that `dims` has type `Mapping[Hashable, int]`. It also breaks our normal concept of alignment, which we need to put coordinate variables in DataArrays alongside data variables. Anecdotal evidence that this is indeed not a good solution: scipp's "ragged data" implementation was originally implemented with such a variable-length dimension support. This led to a whole series of problems, including significantly complicating `scipp.DataArray`, both in terms of code and conceptually. After this experience we switched to the current model, which exposes only the regular, aligned dimensions.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1216144957	https://github.com/pydata/xarray/issues/4285#issuecomment-1216144957	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IfOY9	SimonHeybrock 12912489	2022-08-16T04:54:25Z	2022-08-16T04:54:25Z	NONE	Is anyone here going to EuroScipy (two weeks from now) and interested in having a chat/discussion about ragged data?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1216125098	https://github.com/pydata/xarray/issues/4285#issuecomment-1216125098	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IfJiq	SimonHeybrock 12912489	2022-08-16T04:17:52Z	2022-08-16T04:17:52Z	NONE	@danielballan mentioned that the photon community (synchrotrons/X-ray scattering) is starting to talk more and more about ragged data related to "event mode" data collection as well.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1216123818	https://github.com/pydata/xarray/issues/4285#issuecomment-1216123818	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IfJOq	SimonHeybrock 12912489	2022-08-16T04:15:24Z	2022-08-16T04:15:24Z	NONE	5. Neutron scattering data Scipp is an xarray-like labelled data structure for neutron scattering experiment data. On their FAQ Q titled "Why is xarray not enough", one of the things they quote is Support for event data, a particular form of sparse data. More concretely, this is essentially a 1-D (or N-D) array of random-length lists, with very small list entries. This type of data arises in time-resolved detection of neutrons in pixelated detectors. Would a `RaggedArray` class that's wrappable in xarray help with this? (cc @SimonHeybrock) Partially, but the bigger challenge may be the related algorithms, e.g., for getting data into this layout, and for switching to other ragged layouts. For context, one of the main reasons for our data layout is the ability to make cuts/slices quickly. We frequently deal with 2-D, 3-D, and 4-D data. For example, a 3-D case may be be the momentum transfer $\vec Q$ in a scattering process, with a "record" for every detected neutron. Desired final resolution may exceed 1000 per dimension (of the 3 components of $\vec Q$). On top of this there may be additional dimensions relating to environment parameters of the sample under study, such as temperature, pressure, or strain. This would lead to bin-counts that cannot be handled easily (in single-node memory). A naive solution could be to simply work with something like `pandas.DataFrame`, with columns for the components of $\vec Q$ as well as the sample environment parameters. Those could then be used for grouping/histogramming to the desired 2-D cuts or slices. However, as frequently many such slices or required this can quickly become inefficient (though there is certainly cases where it would work well, providing a simpler solution that scipp). Scipp's ragged data can be considered a "partial sorting", to build a sort of "index". Based on all this we can then, e.g., quickly compute high-resolution cuts. Say we are in 3-D (Qx, Qy, Qz). We would not have bin sizes that match the final resolution required by the science. Instead we could use 50x50x50 bins. Then we can very quickly produce a high-res 2-D plot (say (1000x1000), Qx, Qz or whatever), since our binned data format reduces the data/memory you have to load and consider by a factor of up to 50 (in this example).	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1211328405	https://github.com/pydata/xarray/issues/4285#issuecomment-1211328405	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IM2eV	jpivarski 1852447	2022-08-10T22:01:55Z	2022-08-10T22:01:55Z	NONE	This is a wonderful list; thank you! I'm not sure whether the `RaggedArray` class being proposed here would work for that use case [Alleles in Genomics]? I believe that this use-case benefits from being able to mix regular and ragged dimensions, that the data have 3 regular dimensions and 1 ragged dimension, with the ragged one as the innermost. (The RaggedArray described above has this feature.) Support for event data, a particular form of sparse data. I might have been misinterpreting the word "sparse data" in conversations about this. I had thought that "sparse data" is logically rectilinear but represented in memory with the zeros removed, so the internal machinery has to deal with irregular structures, but the outward API it presents is regular (dimensionality is completely described by a `shape: tuple[int]`). But this usage, 1-D (or N-D) array of random-length lists, with very small list entries. is definitely what we mean by a ragged array (again with the ragged dimension potentially within zero or more regular dimensions).	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1211197176	https://github.com/pydata/xarray/issues/4285#issuecomment-1211197176	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IMWb4	TomNicholas 35968931	2022-08-10T19:51:43Z	2022-08-10T19:56:02Z	MEMBER	Also on the digression, I just want to clarify where we're coming from, why we did the things we did. Very interesting @jpivarski - that would make a good blog post / think piece if you ever felt like it. Two possible conclusions: I'm biased in thinking that (1) is true, but then I'm not a particle physicist - the closest I came was using ROOT in undergrad extremely briefly :smile: . If it turns out that conclusion (1) is right or more right than (2), then at least a subset of what we're working on is going to be useful to the wider community. That said, as we've been looking for use-cases beyond particle physics, most of them would be handled well by simple ragged arrays. Either way, I would definitely encourage figuring out some actual use-cases before building this out :) Does anyone see any other potential use case? Now seems like a good time to list some potential use cases for a `RaggedArray` that's wrappable by xarray, and tag people who might be interested in taking the development on as a project. 1) Oceanography observation data NOAA's Global Drifter Program tracks the movement of floating buoys, each of which takes measurements at specified time intervals as it moves along. As each drifter may take a completely different path across the ocean, the length of their trajectories is variable. @dhruvbalwada pointed me to this notebook which compares analyzing drifter data using 1) xarray wrapping rectilinear arrays 2) pandas 3) `awkward.Array` Reading the notebook it seems that a new option (4) of ragged data within xarray might well be the best of both worlds for this particular use case. @selipot @philippemiron is creating a `RaggedArray` class in order to wrap awkward data in xarray something that could be tackled as part of the @Cloud-Drift project? (cc @Marioherreroglez too) 2) Alleles in Genomics Allele data can have a wide variation in the number of alt alleles (most variants will have one, but a few could have thousands), as mentioned by @tomwhite in https://github.com/pystatgen/sgkit/issues/634. I'm not sure whether the `RaggedArray` class being proposed here would work for that use case? I'm also unclear if this would be useful for ANNData https://github.com/scverse/anndata/issues/744 (cc @ivirshup) 3) Neutron scattering data Scipp is an xarray-like labelled data structure for neutron scattering experiment data. On their FAQ Q titled "Why is xarray not enough", one of the things they quote is Support for event data, a particular form of sparse data. More concretely, this is essentially a 1-D (or N-D) array of random-length lists, with very small list entries. This type of data arises in time-resolved detection of neutrons in pixelated detectors. Would a `RaggedArray` class that's wrappable in xarray help with this? (cc @simonheybrock) 4) Other "Record"-like data A "Record" is for when you want to store multiple pieces of information (of possibly different types) about an "event". In `awkward` a `Record` can be contained within an `awkward.array`. Whilst I don't think we can store awkward arrays containing Records directly in xarray (though after @shoyer's comment I'm not so sure...), what we could do is have multiple named data variables, each of which contains a `RaggedArray` of the same shape. This should be roughly equivalent IIUC. As an example of a quirky use case for record-like data, a biologist friend recently showed me a dataset of hummingbird feeding patterns. He had strapped RFID tags to hundreds of hummingbirds, then set up feeder stations equipped with radio antennae. When the birds came to feed an event would be recorded. As the resulting data varied with bird ID, date, and feeder, but each individual bird could visit any particular feeder any number of times on a given day, I thought he could store this data in a Ragged array within xarray with the dimension representing number of visits having variable length. There are probably a lot more possible use cases for a `RaggedArray` in xarray that I'm not currently aware of!	{ "total_count": 3, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 3, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1210718350	https://github.com/pydata/xarray/issues/4285#issuecomment-1210718350	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IKhiO	jpivarski 1852447	2022-08-10T14:01:48Z	2022-08-10T14:01:48Z	NONE	Also on the digression, I just want to clarify where we're coming from, why we did the things we did. (Digression: From my perspective part of the problem is that merely generalising numpy arrays to be ragged would have been useful for lots of people, but `awkward.Array` goes a lot further. It also generalises the type system, adds things like Records, and possibly adds xarray-like features. That puts `awkward.Array` in a somewhat ill-defined place within the wider scientific python ecosystem: it's kind of a numpy-like duck array, but can't be treated as one, it's also a more general type system, and it might even get features of higher-level data structures like xarray.) I can see how minimal extensions of the NumPy array model to include ragged arrays represent the majority of use-cases, though it wouldn't have been enough for our first use-case in particle physics, which looks roughly like this (with made-up numbers): python collision_events = ak.Array([ { "event": 0, "primary_vertex": {"x": 0.001, "y": -0.002, "z": 0.3}, "electrons": [ {"px": 1.1, "py": 2.2, "pz": 3.3, "E": 4.4, "EoverP": 1.07}, {"px": 1.0, "py": 3.2, "pz": 3.4, "E": 4.5, "EoverP": 0.93}, ], "muons": [ {"px": 0.1, "py": 2.3, "pz": 4.3, "E": 5.4, "isolation": 0}, {"px": 1.1, "py": 2.2, "pz": 3.3, "E": 4.4, "isolation": 0.9}, {"px": 1.1, "py": 2.2, "pz": 3.3, "E": 4.4, "isolation": 0.8}, ], }, { "event": 1, "primary_vertex": {"x": -0.001, "y": 0.002, "z": 0.4}, "electrons": [ {"px": 1.0, "py": 3.2, "pz": 3.4, "E": 4.5, "EoverP": 0.93}, ], "muons": [], }, ..., ]) We needed "records with differently typed fields" and "variable-length lists" to be nestable within each other. It's even sometimes the case that one of the inner records representing a particle has another variable-length list within it, identifying the indexes of particles in the collision event that it's close to. We deliberated on whether those cross-links should allow the structure to be non-tree-like, either a DAG or to actually have cycles (https://github.com/scikit-hep/awkward/issues/178). The prior art is a C++ infrastructure that does have a full graph model: collision events represented as arbitrary C++ class instances, and those arbitrary C++ data are serialized to disk in exabytes of ROOT files. Our first problem was to get a high-performance representation of these data in Python. For that, we didn't need missing data or heterogeneous unions (`std::optional` and `std::variant` are rare in particle physics data models), but it seemed like a good idea to complete the type system because functions might close over the larger space of types. That ended up being true for missing data: they've come to be useful in such functions as ak.fill_none and filtering without changing array lengths (ak.mask and array.mask). Heterogeneous unions have not been very useful. Some input data produce such types, like GeoJSON with mixed points and polygons, but the first thing one usually wants to do is restructure the data into non-unions. Another consideration is that this scope exactly matches Apache Arrow (including the lack of cross-references). As such, we can use Arrow as an interchange format and Parquet as a disk format without having to exclude a subspace of types in either direction. We don't use Arrow as an internal format for performance reasons—we have node types that are lazier than Arrow's so they're better as intermediate arrays in a multi-step calculation—but it's important to have one-to-one, minimal computation (sometimes zero-copy) transformations to and from Arrow. That said, as we've been looking for use-cases beyond particle physics, most of them would be handled well by simple ragged arrays. Also, we've found the "just ragged arrays" part of Arrow to be the most developed or at least the first to be developed, driven by SQL-like applications. Our unit tests in Awkward Array have revealed a lot of unhandled cases in Arrow, particularly the Parquet serialization, that we report in JIRA (and they quickly get resolved). Two possible conclusions: Ragged arrays are all that's really needed in most sciences; particle physics is a weird exception, and Arrow is over-engineered. Arrays of general type have as-yet undiscovered utility in most sciences: datasets have been cast into the forms they currently take so that they can make better use of the tools that exist, not because it's the most natural way to frame and use the data. Computing in particle physics had been siloed for decades, not heavily affected by SQL/relational/tidy ways of talking about data: maybe this is like a first contact of foreign cultures that each have something to contribute. (Particle physics analysis has been changing a lot by becoming less bound to an edit-compile-run cycle.) If it turns out that conclusion (1) is right or more right than (2), then at least a subset of what we're working on is going to be useful to the wider community. If it's (2), though, then it's a big opportunity.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1210268984	https://github.com/pydata/xarray/issues/4285#issuecomment-1210268984	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IIz04	Illviljan 14371165	2022-08-10T07:24:55Z	2022-08-10T07:25:24Z	MEMBER	If that point is negotiable, I could introduce an `ak.shape_dtype(array)` function that returns a shape and dtype if `array` has the right properties and raise an exception if it doesn't. That would be more normal: you're asking if it satisfies a specific constraint, and if so, to get some information about it. Then we would also be able to deal with the fact that * some people are going to want the `shape` to specify the maximum of "var" dimensions (what you asked for): "virtually padding", * some people are going to want the `shape` to specify the minimum of "var" dimensions because that tells you what upper bounds are legal to slice: "virtually truncating", * and some people are going to want the string `"var"` or maybe `None` or maybe `np.nan` in place of "var" dimensions because no integer is correct. Then they would have to deal with the fact that this `shape` is not a tuple of integers. Getting non-ints out of `.shape` is not uncommon already. dask, which is one of our favorite duck arrays, uses `np.nan` to imply undefined shape (`-> tuple(int \| type(np.nan), ...)`), which happens EVERY time a dask array is masked. I don't think it would be super strange to get -`> tuple(int \| type(np.nan) \| Literal["var"], ...)` out of awkwards shape.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1209967070	https://github.com/pydata/xarray/issues/4285#issuecomment-1209967070	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IHqHe	TomNicholas 35968931	2022-08-09T22:47:24Z	2022-08-10T05:50:40Z	MEMBER	Thanks for the huge response there @jpivarski ! Ragged array is not a specialized subset of types within Awkward Array. There are `ak.` functions that would take you out of this subset. However (thinking it through...) I don't think slices, ufuncs, or reducers would take you out of this subset. This is an important point which I meant to ask about earlier. We need a `RaggedArray` class which always returns other `RaggedArray` instances (i.e. the set of ragged arrays is closed under the set of numpy-like methods / functions that xarray might call upon it). To answer your question about monkey-patching, I think it would be best to make a wrapper. You don't want to give all `ak.Array` instances properties named shape and dtype, since those properties won't make sense for general types. This is exactly the reason we had to back off on making `ak.Array` inherit from `pandas.api.extensions.ExtensionArray`: Pandas wanted it to have methods with names and behaviors that would have been misleading for Awkward Arrays. If you want a `RaggedArray` class that is more specific (i.e. defines more attributes) than `awkward.Array`, then surely the "correct" thing to do would be be to subclass though? I mean for eventual integration of `RaggedArray` within awkward's codebase. Thus, it can act as a gatekeeper of what kinds of operations are allowed: `ak.` won't recognize `RaggedArray`, which is good because some `ak.` functions would take you out of this "ragged array" subset of types. You can add some non-ufunc NumPy functions with `__array_function__`, but only the ones that make sense for this subset of types. That makes sense. And if you subclassed then I guess you would also need to change those `ak.` functions to not accept `RaggedArray`, so maybe wrapping is better... Thanks for the wrapping example! I think there is a bug with your `.shape` method though - if I put your code snippets in a file then they return the wrong results: ```python In [1]: from ragged import RaggedArray In [2]: ra = RaggedArray([[1, 2, 3], [4, 5]]) In [3]: ra.ndim Out[3]: 1 In [4]: ra.shape Out[4]: [3] `` (I expected2`and`(2, 3)`respectively). I think perhaps`context["shape"]` is being overwritten as it recurses through the data structure, when it should be being appended? I would really like to try testing the `RaggedArray` class with our WIP public framework for testing duck array compatiblity (#6894). If we can get a very basic wrapper then I could make a PR to add `RaggedArray` to awkward, and import xarray's new tests to test it with.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1210190649	https://github.com/pydata/xarray/issues/4285#issuecomment-1210190649	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IIgs5	shoyer 1217238	2022-08-10T05:48:47Z	2022-08-10T05:48:47Z	MEMBER	I am tempted to suggest that the right way to handle Awkward array is to treat "var" dimensions similar to NumPy's structured dtypes, with `shape` only handling non-variable dimensions. The uniform dimensions are the only ones for which Xarray's API is going to work properly out of the box, and Awkward array properly already has the right tools for working with ragged dimensions. Either way, I would definitely encourage figuring out some actual use-cases before building this out :)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1210175870	https://github.com/pydata/xarray/issues/4285#issuecomment-1210175870	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IIdF-	TomNicholas 35968931	2022-08-10T05:25:17Z	2022-08-10T05:32:13Z	MEMBER	Since `RaggedArray` can't be used everywhere that an `ak.Array` can be used, it shouldn't be a subclass. I see, makes sense. I hadn't been thinking that RaggedArray is something we'd put in the general Awkward Array library. Oh I was just thinking if we're building a new class that is tightly coupled to `awkward.Array` then it should live in `awkward`. (I also would like someone else to maintain it ideally! :sweat_smile: ) I was thinking of it only as a way to define "the subset of Awkward Arrays that xarray uses," which would live in xarray. I don't think it's within scope of xarray to offer a numpy-like array class in our main library - we don't do this for any other case! Or it could be a third package, as awkward-pandas is to awkward and pandas. However we could definitely have a separate `awkward-xarray` package that lives in xarray-contrib and provides a `RaggedArray` class. (see pint-xarray for something sort of similar.) That seems fine, all it takes is some keen bean to take our prototypes here and turn them into something usable... (Imagine reading the docs and it says, "You can apply this function to ak.Array, but not to ak.RaggedArray." Or "this is an ak.Array that happens to be ragged, but not a ak.RaggedArray.") Yeah that wouldn't be ideal. (Digression: From my perspective part of the problem is that merely generalising numpy arrays to be ragged would have been useful for lots of people, but `awkward.Array` goes a lot further. It also generalises the type system, adds things like Records, and possibly adds xarray-like features. That puts `awkward.Array` in a somewhat ill-defined place within the wider scientific python ecosystem: it's kind of a numpy-like duck array, but can't be treated as one, it's also a more general type system, and it might even get features of higher-level data structures like xarray.) some people are going to want the `shape` to specify the maximum of "var" dimensions (what you asked for): "virtually padding", some people are going to want the `shape` to specify the minimum of "var" dimensions because that tells you what upper bounds are legal to slice: "virtually truncating", and some people are going to want the string "var" or maybe `None` or maybe `np.nan` in place of "var" dimensions because no integer is correct. Then they would have to deal with the fact that this shape is not a tuple of integers. That's very interesting. I'm not immediately sure which of those would be best for xarray wrapping - I think it's plausible that we could eventually support any of those options... ((3) through the issues Deepak linked to (#5168, #2801).) I fixed the code that I wrote in the comments above for posterity. Thanks for fixing that, and for all the explanations!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1210023820	https://github.com/pydata/xarray/issues/4285#issuecomment-1210023820	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IH3-M	jpivarski 1852447	2022-08-10T00:36:42Z	2022-08-10T00:36:42Z	NONE	If you want a `RaggedArray` class that is more specific (i.e. defines more attributes) than `awkward.Array`, then surely the "correct" thing to do would be be to subclass though? It shouldn't be a subclass because it doesn't satisfy a substitution principle: `ak.combinations(array: ak.Array, n: int) -> ak.Array`, but `ak.combinations(array: RaggedArray, n: int) -> ⊥` (at best, would raise an exception because `RaggedArray` isn't closed under `ak.combinations`). Since `RaggedArray` can't be used everywhere that an `ak.Array` can be used, it shouldn't be a subclass. I mean for eventual integration of `RaggedArray` within awkward's codebase. Oh....... I hadn't been thinking that RaggedArray is something we'd put in the general Awkward Array library. I was thinking of it only as a way to define "the subset of Awkward Arrays that xarray uses," which would live in xarray. I don't want to introduce another level of type-specificity to the system, since that would make things harder to understand. (Imagine reading the docs and it says, "You can apply this function to ak.Array, but not to ak.RaggedArray." Or "this is an ak.Array that happens to be ragged, but not a ak.RaggedArray.") So let me rethink your original idea of adding `shape` and `dtype` properties to all ak.Arrays. Perhaps they should raise exceptions when the array is not a ragged array? People don't usually expect properties to raise exceptions, and you really need them to be properties with the exact spellings "`shape`" and "`dtype`" to get what you want. If that point is negotiable, I could introduce an `ak.shape_dtype(array)` function that returns a shape and dtype if `array` has the right properties and raise an exception if it doesn't. That would be more normal: you're asking if it satisfies a specific constraint, and if so, to get some information about it. Then we would also be able to deal with the fact that some people are going to want the `shape` to specify the maximum of "var" dimensions (what you asked for): "virtually padding", some people are going to want the `shape` to specify the minimum of "var" dimensions because that tells you what upper bounds are legal to slice: "virtually truncating", and some people are going to want the string `"var"` or maybe `None` or maybe `np.nan` in place of "var" dimensions because no integer is correct. Then they would have to deal with the fact that this `shape` is not a tuple of integers. Or maybe the best way to present it is with a `min_shape` and a `max_shape`, whose items are equal where the array is regular. Anyway, you can see why I'm loath to add a property to ak.Array that's just named "`shape`"? That has the potential for misinterpretation. (Pandas wanted arrays to have a `shape` that is always equal to `(len(array),)`; if we satisfied that constraint, we couldn't satisfy yours!) In fact, "`dtype`" of a general array would be misleading, too, though a list of unique "`dtypes`" of all the leaf-nodes could be a useful thing to have. (2 shapes and n dtypes!) But if I'm providing it as an extra function, or as a trio of properties named `min_shape`, `max_shape`, and `dtypes` which are all spelled differently from the `shape` and `dtype` you want, you'd then be forced to wrap it as a RaggedArray type within xarray again, anyway. Which, as a reminder, is what we're doing for Pandas: https://github.com/intake/awkward-pandas lives outside the Awkward codebase and it wraps ak.Array to put them in Series. So in the end, I just came back to where we started: xarray would own the RaggedArray wrapper. Or it could be a third package, as awkward-pandas is to awkward and pandas. (I expected `2` and `(3, 2)` respectively). I think perhaps `context["shape"]` is being overwritten as it recurses through the data structure, when it should be being appended? No, I initialized it incorrectly: it should have started as `python context = {"shape": [len(array)]}` and then recurse from there. My previous example also had the wrong output, but I didn't count square brackets carefully enough to have caught it. (By the way, not copying the context is why it's called "lateral"; if a copied dict is needed, it's "depth_context". I just went back and checked: yes, they're being handled appropriately.) I fixed the code that I wrote in the comments above for posterity.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1208723159	https://github.com/pydata/xarray/issues/4285#issuecomment-1208723159	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85IC6bX	jpivarski 1852447	2022-08-08T23:30:12Z	2022-08-10T00:02:44Z	NONE	Given that you have an array of only list-type, regular-type, and numpy-type (which the `prepare` function, above, guarantees), here's a one-pass function to get the dtype and shape: ```python def shape_dtype(layout, lateral_context, *kwargs): if layout.is_RegularType: lateral_context["shape"].append(layout.size) elif layout.is_ListType: max_size = ak.max(ak.num(layout)) lateral_context["shape"].append(max_size) elif layout.is_NumpyType: lateral_context["dtype"] = layout.dtype else: raise AssertionError(f"what? {layout.form.type}") context = {"shape": [len(array)]} array.layout.recursively_apply( shape_dtype, lateral_context=context, return_array=False ) check context for "shape" and "dtype" ``` Here's the application on an array of mixed regular and irregular lists: ```python array = ak.to_regular(ak.Array([[[[1, 2, 3], []]], [[[4], [5]], [[6, 7], [8]]]]), axis=2) print(array.type) 2 var * 2 * var * int64 context = {"shape": [len(array)]} array.layout.recursively_apply( ... shape_dtype, lateral_context=context, return_array=False ... ) context {'shape': [2, 2, 2, 3], 'dtype': dtype('int64')} ``` (This `recursively_apply` is a Swiss Army knife for restructuring or getting data out of layouts that we use internally all over the codebase, and intend to make public in v2: https://github.com/scikit-hep/awkward/issues/516.) To answer your question about monkey-patching, I think it would be best to make a wrapper. You don't want to give all `ak.Array` instances properties named `shape` and `dtype`, since those properties won't make sense for general types. This is exactly the reason we had to back off on making `ak.Array` inherit from `pandas.api.extensions.ExtensionArray`: Pandas wanted it to have methods with names and behaviors that would have been misleading for Awkward Arrays. We think we'll be able to reintroduce Awkward Arrays as Pandas columns by wrapping them—that's what we're doing differently this time. Here's a start of a wrapper: ```python class RaggedArray: def init(self, array_like): layout = ak.to_layout(array_like, allow_record=False, allow_other=False) behavior = None if isinstance(array_like, ak.Array): behavior = array_like.behavior self._array = ak.Array(layout.recursively_apply(prepare), behavior=behavior) context = {"shape": [len(self._array)]} self._array.layout.recursively_apply( shape_dtype, lateral_context=context, return_array=False ) self._shape = context["shape"] self._dtype = context["dtype"] def __repr__(self): # this is pretty cheesy return "<Ragged" + repr(self._array)[1:] @property def dtype(self): return self._dtype @property def shape(self): return self._shape def __getitem__(self, where): if isinstance(where, RaggedArray): where = where._array if isinstance(where, tuple): where = tuple(x._array if isinstance(x, RaggedArray) else x for x in where) out = self._array[where] if isinstance(out, ak.Array): return RaggedArray(out) else: return out def __array_ufunc__(self, ufunc, method, inputs, kwargs): inputs = [x._array if isinstance(x, RaggedArray) else x for x in inputs] out = self._array.__array_ufunc__(ufunc, method, inputs, *kwargs) return RaggedArray(out) def sum(self, axis=None, keepdims=False, mask_identity=False): out = ak.sum(self._array, axis=axis, keepdims=keepdims, mask_identity=mask_identity) if isinstance(out, ak.Array): return RaggedArray(out) else: return out ``` It keeps an `_array` (`ak.Array`), performs all internal operations on the `ak.Array` level (unwrapping RaggedArrays if necessary), but returns RaggedArrays (if the output is not scalar). It handles only the operations you want it to: this one handles all the complex slicing, NumPy ufuncs, and one reducer, `sum`. Thus, it can act as a gatekeeper of what kinds of operations are allowed: `ak.` won't recognize RaggedArray, which is good because some `ak.*` functions would take you out of this "ragged array" subset of types. You can add some non-ufunc NumPy functions with `__array_function__`, but only the ones that make sense for this subset of types. I meant to say something earlier about why we go for full generality in types: it's because some of the things we want to do, such as ak.cartesian, require more complex types, and as soon as one function needs it, the whole space needs to be enlarged. For the first year of Awkward Array use, most users wanted it for plain ragged arrays (based on their bug-reports and questions), but after about a year, they were asking about missing values and records, too, because you eventually need them unless you intend to work within a narrow set of functions. Union arrays are still not widely used, but they can come from some file formats. Some GeoJSON files that I looked at had longitude, latitude points in different list depths because some were points and some were polygons, disambiguated by a string label. That's not good to work with (we can't handle that in Numba, for instance), but if you select all points with some slice, put them in one array, and select all polygons with another slice, putting them in their own array, these each become trivial unions, and that's why I added the squashing of trivial unions to the `prepare` function example above.	{ "total_count": 3, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 3, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1208646168	https://github.com/pydata/xarray/issues/4285#issuecomment-1208646168	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85ICnoY	jpivarski 1852447	2022-08-08T21:46:57Z	2022-08-08T21:46:57Z	NONE	The passing on of behavior is just to not break applications that depend on it. I did that just for correctness. Monkey-patching will add the desired properties to the `ak.Array` class (a part of the problem I haven't addressed yet), though it would do so globally for all `ak.Arrays`, including those that are not simple ragged arrays. The function I wrote above would take a general array and simplify it to a ragged array or die trying. Ragged array is not a specialized subset of types within Awkward Array. There are `ak.*` functions that would take you out of this subset. However (thinking it through...) I don't think slices, ufuncs, or reducers would take you out of this subset. More later...	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1208617600	https://github.com/pydata/xarray/issues/4285#issuecomment-1208617600	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85ICgqA	TomNicholas 35968931	2022-08-08T21:15:01Z	2022-08-08T21:15:27Z	MEMBER	You mentioned union arrays, but for completeness, the type system in Awkward Array has ... Here's a way to determine if an array (Python type ak.Array) is in that subset That's very helpful, thank you! `python ak.Array(array.layout.recursively_apply(prepare), behavior=array.behavior)` (FWIW I find the "behavior" stuff very confusing in general, even after reading the docs page on it. I don't really understand why I can't just reimplement my monkey-patched example above by subclassing `ak.Array`, or should I be wrapping it?) it would be possible to define shape with some token for the variable-length dimensions and dtype. How would I do this without monkey-patching? All I really want (and I hazard all that most xarray users want) is to be able to import some class from `awkward` that offers only the simplest possible Ragged Array, that conforms to the data API standard (i.e. defines `shape` and `dtype`). Oh, if you're replacing variable-length dimensions with the maximum length in that dimension, what about actually padding the array with ak.pad_none? What's the benefit of doing this over just using `ak.num` on each axis like I did above? That uses all the memory of a padded array, but it's what people use now if they want to convert Awkward data into non-Awkward data (maybe passing the final step to ak.to_numpy). I can see that this might be useful in xarray's `.to_numpy` methods though. This is exciting though @jpivarski !	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1208568777	https://github.com/pydata/xarray/issues/4285#issuecomment-1208568777	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85ICUvJ	jpivarski 1852447	2022-08-08T20:18:30Z	2022-08-08T20:18:30Z	NONE	You mentioned union arrays, but for completeness, the type system in Awkward Array has numeric primitives: `issubclass(dtype.type, (np.bool_, np.number, np.datetime64, np.timedelta64))` (including complex) variable-length lists regular-length lists missing data (through masks and indexes, not NaNs) record structures heterogeneous unions You're interested in a subset of this type system, but that subset doesn't just exclude unions, it also excludes records. If you have an xarray, you don't need top-level records since those could just be the columns of an xarray, but some data source might provide records nested within variable-length lists (very common in HEP) or other nesting. It would have to be explicitly excluded. That leaves the possibility of missing lists and missing numeric primitives. Missing lists could be turned into empty lists (Google projects like Protocol Buffers often make that equivalence) and missing numbers could be turned into NaN if you're willing to lose integer-ness. Here's a way to determine if an `array` (Python type `ak.Array`) is in that subset and to pre-process it, ensuring that you only have numbers, variable-length, and regular-length lists (in Awkward version 2, so note the "`_v2`"): ```python import awkward._v2 as ak import numpy as np def prepare(layout, continuation, **kwargs): if layout.is_RecordType: raise NotImplementedError("no records!") elif layout.is_UnionType: if len(layout) == 0 or np.all(layout.tags) == layout.tags[0]: return layout.project(layout.tags[0]).recursively_apply(prepare) else: raise NotImplementedError("no non-trivial unions!") elif layout.is_OptionType: next = continuation() # fully recurse content_type = next.content.form.type if isinstance(content_type, ak.types.NumpyType): return ak.fill_none(next, np.nan, axis=0, highlevel=False) elif isinstance(content_type, ak.types.ListType): return ak.fill_none(next, [], axis=0, highlevel=False) elif isinstance(content_type, ak.types.RegularType): return ak.fill_none(next.toListOffsetArray64(False), [], axis=0, highlevel=False) else: raise AssertionError(f"what? {content_type}") ak.Array(array.layout.recursively_apply(prepare), behavior=array.behavior) ``` It should catch all the cases and doesn't rely on string-processing the type's DataShape representation. Given that you're working within that subset, it would be possible to define `shape` with some token for the variable-length dimensions and `dtype`. I can follow up with another message (I have to deal with something else at the moment). Oh, if you're replacing variable-length dimensions with the maximum length in that dimension, what about actually padding the array with ak.pad_none? `python ak.fill_none(ak.pad_none(array, ak.max(ak.num(array))), np.nan)` The above would have to be expanded to get every `axis`, but it makes all nested lists have the length of the longest one by padding with `None`, then replaces those `None` values with `np.nan`. That uses all the memory of a padded array, but it's what people use now if they want to convert Awkward data into non-Awkward data (maybe passing the final step to ak.to_numpy).	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 1 }	Awkward array backend? 667864088
1203295236	https://github.com/pydata/xarray/issues/4285#issuecomment-1203295236	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85HuNQE	jpivarski 1852447	2022-08-02T23:03:16Z	2022-08-02T23:03:16Z	NONE	Hi! I will be looking deeply into this when I get back from traveling (next week). Just to let you know that I saw this and I'm interested. Thanks!	{ "total_count": 2, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 2, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
1203256821	https://github.com/pydata/xarray/issues/4285#issuecomment-1203256821	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85HuD31	dcherian 2448579	2022-08-02T22:01:43Z	2022-08-02T22:01:43Z	MEMBER	Cool experiment Tom. Generalise xarray to allow for variable-length dimensions This is somewhat similar to supporting nan-shaped dask arrays (https://github.com/pydata/xarray/issues/5168, https://github.com/pydata/xarray/issues/2801).	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 1 }	Awkward array backend? 667864088
1200110315	https://github.com/pydata/xarray/issues/4285#issuecomment-1200110315	https://api.github.com/repos/pydata/xarray/issues/4285	IC_kwDOAMm_X85HiDrr	TomNicholas 35968931	2022-07-30T07:40:59Z	2022-07-30T07:40:59Z	MEMBER	So I actually think we can do this, with some caveats. I recently found a cool dataset with ragged-like data which has rekindled my interest in this interfacing, and given me a real example to try it out with. As far as I understand it the main problem is that awkward arrays don't define a `shape` or `dtype` attribute. Instead they follow a different model (the "datashape" model). Xarray expects `shape` and `dtype` to be defined, and given that those attributes are in the data API standard, this is a pretty reasonable expectation for most cases. (There is a useful discussion here on the data-apis consortium repo about why awkward arrays don't define these attributes in general.) Conceptually though, it seems to me that `shape` and `dtype` do make sense for Awkward arrays, at least for some subset of them, because Awkward's "type" is clearly related to the normal notion of `shape` and `dtype`. Let's take an Awkward array that can be coerced directly to a numpy array: ```python In [27]: rect = ak.Array([[1, 2, 3], [4, 5, 6]]) ...: rect Out[27]: <Array [[1, 2, 3], [4, 5, 6]] type='2 * var * int64'> In [28]: np.array(rect) Out[28]: array([[1, 2, 3], [4, 5, 6]]) `` Here there is a clear correspondence: the first axis of the awkward array has length 2, and because in this case the second axis has a consistent length of 3, we can coerce this to a numpy array withshape=(2,3)`. The dtype also makes sense, because in this case the awkward array only contains data of one type, an`int64`. Now imagine a "ragged" (or "jagged") array, which is like a numpy array except that the lengths along one (or more) of the axes can be variable. Awkward allows this, e.g. `python In [29]: ragged = ak.Array([[1, 2, 3, 100], [4, 5, 6]]) ...: ragged Out[29]: <Array [[1, 2, 3, 100], [4, 5, 6]] type='2 * var * int64'>` but a direct coercion to numpy will fail. However we still conceptually have a "shape". It's either `(2, "var")`, where "var" means a variable length across the other axes, or alternatively we could say the shape is `(2, 4)`, where `4` is simply the maximum length along the variable-length axis. The latter interpretation is kind of similar to sparse arrays. In the second case you can still read off the dtype too. However awkward also allows "Union types", which basically means that one array can contain data of multiple numpy dtypes. Unfortunately this seems to completely break the numpy / xarray model, but we can completely ignore this problem if we simply say that xarray should only try to wrap awkward arrays with non-Union types. I think that's okay - a ragged-length array with a fixed dtype would still be extremely useful! So if we want to wrap an (non-union type) awkward array instance like `ragged` in xarray we have to do one of two things: 1) Generalise xarray to allow for variable-length dimensions This seems hard. Xarray's whole model is built assuming that `dims` has type `Mapping[Hashable, int]`. It also breaks our normal concept of alignment, which we need to put coordinate variables in DataArrays alongside data variables. It would also mean a big change to xarray in order to support one unusual type of array, that goes beyond the data API standard. That breaks xarray's general design philosophy of providing a general wrapper and delegating to domain-specific array implementations / backends / etc. for specificity. 2) Expose a version of `shape` and `dtype` on Awkward arrays This doesn't seem as hard, at least for non-union type awkward arrays. In fact this crude monkey-patching seems to mostly work: ```python In [1]: from awkward import Array, num ...: import numpy as np In [2]: def get_dtype(self) -> np.dtype: ...: if "Union" in str(self.type): ...: raise ValueError("awkward arrays with Union types can't be expressed in terms of a single numpy dtype") ...: ...: datatype = str(self.type).split(" * ")[-1] ...: ...: if datatype == "string": ...: return np.dtype("str") ...: else: ...: return np.dtype(datatype) ...: In [3]: def get_shape(self): ...: if "Union" in str(self.type): ...: raise ValueError("awkward arrays with Union types can't be expressed in terms of a single numpy dtype") ...: ...: lengths = str(self.type).split(" * ")[:-1] ...: ...: for axis in range(self.ndim): ...: if lengths[axis] == "var": ...: lengths[axis] = np.max(num(self, axis)) ...: else: ...: lengths[axis] = int(lengths[axis]) ...: ...: return tuple(lengths) ...: In [4]: def get_size(self): ...: return np.prod(get_shape(self)) ...: In [5]: setattr(Array, 'dtype', property(get_dtype)) ...: setattr(Array, 'shape', property(get_shape)) ...: setattr(Array, 'size', property(get_size)) `` Now if we make the same ragged array but with the monkey-patched class, we have a sensible return value fordtype`,`shape`, and`size`, which means that the xarray constructors will accept our Array now! ```python In [6]: ragged = Array([[1, 2, 3, 100], [4, 5, 6]]) In [7]: import xarray as xr In [8]: da = xr.DataArray(ragged, dims=['x', 't']) In [17]: da Out[17]: <xarray.DataArray (x: 2, t: 4)> <Array [[1, 2, 3, 100], [4, 5, 6]] type='2 * var * int64'> Dimensions without coordinates: x, t In [18]: da.dtype Out[18]: dtype('int64') In [19]: da.size Out[19]: 8 In [20]: da.shape Out[20]: (2, 4) ``` Promising... Let's try indexing: ```python In [21]: da.isel(t=2) Out[21]: <xarray.DataArray (x: 2)> <Array [3, 6] type='2 * int64'> Dimensions without coordinates: x In [22]: da.isel(t=4) ValueError Traceback (most recent call last) Input In [22], in <cell line: 1>() ----> 1 da.isel(t=4) ... File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/highlevel.py:991, in Array.getitem(self, where) 579 """ 580 Args: 581 where (many types supported; see below): Index of positions to (...) 988 have the same dimension as the array being indexed. 989 """ 990 if not hasattr(self, "_tracers"): --> 991 tmp = ak._util.wrap(self.layout[where], self._behavior) 992 else: 993 tmp = ak._connect._jax.jax_utils._jaxtracers_getitem(self, where) ValueError: in ListOffsetArray64 attempting to get 4, index out of range (https://github.com/scikit-hep/awkward-1.0/blob/1.8.0/src/cpu-kernels/awkward_NumpyArray_getitem_next_at.cpp#L21) ``` That's what should happen - xarray delegates the indexing to the underlying array, which throws an error if there is a problem. Arithmetic also seems to work `python In [23]: da * 2 Out[23]: <xarray.DataArray (x: 2, t: 4)> <Array [[2, 4, 6, 200], [8, 10, 12]] type='2 * var * int64'> Dimensions without coordinates: x, t` But we hit snags with numpy functions ```python In [24]: np.mean(da) TypeError Traceback (most recent call last) Input In [24], in <cell line: 1>() ----> 1 np.mean(da) File <array_function internals>:180, in mean(args, kwargs) File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3430, in mean(a, axis, dtype, out, keepdims, where) 3428 pass 3429 else: -> 3430 return mean(axis=axis, dtype=dtype, out=out, kwargs) 3432 return _methods._mean(a, axis=axis, dtype=dtype, 3433 out=out,* kwargs) File ~/Documents/Work/Code/xarray/xarray/core/_reductions.py:1478, in DataArrayReductions.mean(self, dim, skipna, keep_attrs, kwargs) 1403 def mean( 1404 self, 1405 dim: None \| Hashable \| Sequence[Hashable] = None, (...) 1409 kwargs: Any, 1410 ) -> DataArray: 1411 """ 1412 Reduce this DataArray's data by applying `mean` along some dimension(s). 1413 (...) 1476 array(nan) 1477 """ -> 1478 return self.reduce( 1479 duck_array_ops.mean, 1480 dim=dim, 1481 skipna=skipna, 1482 keep_attrs=keep_attrs, 1483 kwargs, 1484 ) File ~/Documents/Work/Code/xarray/xarray/core/dataarray.py:2930, in DataArray.reduce(self, func, dim, axis, keep_attrs, keepdims, kwargs) 2887 def reduce( 2888 self: T_DataArray, 2889 func: Callable[..., Any], (...) 2895 kwargs: Any, 2896 ) -> T_DataArray: 2897 """Reduce this array by applying `func` along some dimension(s). 2898 2899 Parameters (...) 2927 summarized data and the indicated dimension(s) removed. 2928 """ -> 2930 var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, kwargs) 2931 return self._replace_maybe_drop_dims(var) File ~/Documents/Work/Code/xarray/xarray/core/variable.py:1854, in Variable.reduce(self, func, dim, axis, keep_attrs, keepdims, kwargs) 1852 data = func(self.data, axis=axis, kwargs) 1853 else: -> 1854 data = func(self.data, kwargs) 1856 if getattr(data, "shape", ()) == self.shape: 1857 dims = self.dims File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:579, in mean(array, axis, skipna, kwargs) 577 return _to_pytimedelta(mean_timedeltas, unit="us") + offset 578 else: --> 579 return _mean(array, axis=axis, skipna=skipna, kwargs) File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:341, in _create_nan_agg_method.<locals>.f(values, axis, skipna, kwargs) 339 with warnings.catch_warnings(): 340 warnings.filterwarnings("ignore", "All-NaN slice encountered") --> 341 return func(values, axis=axis, kwargs) 342 except AttributeError: 343 if not is_duck_dask_array(values): File <array_function** internals>:180, in mean(args, kwargs) File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/highlevel.py:1434, in Array.array_function(self, func, types, args, kwargs) 1417 def array_function(self, func, types, args, kwargs): 1418 """ 1419 Intercepts attempts to pass this Array to those NumPy functions other 1420 than universal functions that have an Awkward equivalent. (...) 1432 See also #array_ufunc. 1433 """ -> 1434 return ak._connect._numpy.array_function(func, types, args, kwargs) File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/_connect/_numpy.py:43, in array_function(func, types, args, kwargs) 41 return out 42 else: ---> 43 return function(args,* kwargs) TypeError: mean() got an unexpected keyword argument 'dtype' ``` This seems fixable though. In fact I think if we changed https://github.com/pydata/xarray/issues/6845 (@dcherian) then this alternative would already work ```python In [25]: import awkward as ak In [26]: ak.mean(da) ValueError Traceback (most recent call last) Input In [26], in <cell line: 1>() ----> 1 ak.mean(da) File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/reducers.py:971, in mean(x, weight, axis, keepdims, mask_identity) 969 with np.errstate(invalid="ignore"): 970 if weight is None: --> 971 sumw = count(x, axis=axis, keepdims=keepdims, mask_identity=mask_identity) 972 sumwx = sum(x, axis=axis, keepdims=keepdims, mask_identity=mask_identity) 973 else: File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/reducers.py:79, in count(array, axis, keepdims, mask_identity) 10 def count(array, axis=None, keepdims=False, mask_identity=False): 11 """ 12 Args: 13 array: Data in which to count elements. (...) 77 to turn the None values into something that would be counted. 78 """ ---> 79 layout = ak.operations.convert.to_layout( 80 array, allow_record=False, allow_other=False 81 ) 82 if axis is None: 84 def reduce(xs): File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/convert.py:1917, in to_layout(array, allow_record, allow_other, numpytype) 1914 return from_iter([array], highlevel=False) 1916 elif isinstance(array, Iterable): -> 1917 return from_iter(array, highlevel=False) 1919 elif not allow_other: 1920 raise TypeError( 1921 f"{array} cannot be converted into an Awkward Array" 1922 + ak._util.exception_suffix(file*) 1923 ) File ~/miniconda3/envs/hummingbirds/lib/python3.10/site-packages/awkward/operations/convert.py:891, in from_iter(iterable, highlevel, behavior, allow_record, initial, resize) 889 out = ak.layout.ArrayBuilder(initial=initial, resize=resize) 890 for x in iterable: --> 891 out.fromiter(x) 892 layout = out.snapshot() 893 return ak._util.maybe_wrap(layout, behavior, highlevel) ValueError: cannot convert <xarray.DataArray ()> array(1) (type DataArray) to an array element (https://github.com/scikit-hep/awkward-1.0/blob/1.8.0/src/python/content.cpp#L974) ``` Suggestion: How about awkward offer a specialized array class which uses the same fast code underneath but disallows Union types, and follows the array API standard, implementing `shape`, `dtype` etc. as described above. That should then "just work" in xarray, in the same way that `sparse` arrays already do. Am I missing anything here? @jpivarski tl;dr We probably could support awkward arrays, at least instances where all values have the same dtype.	{ "total_count": 4, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 1, "eyes": 2 }	Awkward array backend? 667864088
707321343	https://github.com/pydata/xarray/issues/4285#issuecomment-707321343	https://api.github.com/repos/pydata/xarray/issues/4285	MDEyOklzc3VlQ29tbWVudDcwNzMyMTM0Mw==	jpivarski 1852447	2020-10-12T20:08:32Z	2020-10-12T20:08:32Z	NONE	Copied from https://gitter.im/pangeo-data/Lobby : I've been using Xarray with argopy recently, and the immediate value I see is the documentation of columns, which is semi-lacking in Awkward (one user has been passing this information through an Awkward tree as a scikit-hep/awkward-1.0#422). I should also look into Xarray's indexing, which I've always seen as being the primary difference between NumPy and Pandas; Awkward Array has no indexing, though every node has an optional Identities which would be used to track such information through Awkward manipulations—Identities would have a bijection with externally supplied indexes. They haven't been used for anything yet. Although the elevator pitch for Xarray is "n-dimensional Pandas," it's rather different, isn't it? The contextual metadata is more extensive than anything I've seen in Pandas, and Xarray can be partitioned for out-of-core analysis: Xarray wraps Dask, unlike Dask's array collection, which wraps NumPy. I had troubles getting Pandas to wrap Awkward array (scikit-hep/awkward-1.0#350 ), but maybe these won't be issues for Xarray. One last thing (in this very rambly message): the main difficulty I think we would have in that is that Awkward Arrays don't have shape and dtype, since those define a rectilinear array of numbers. The data model is Datashape plus union types. There is a sense in which ndim is defined: the number of nested lists before reaching the first record, which may split it into different depths for each field, but even this can be ill-defined with union types: ```python import awkward1 as ak array = ak.Array([1, 2, [3, 4, 5], [[6, 7, 8]]]) array <Array [1, 2, [3, 4, 5], [[6, 7, 8]]] type='4 * union[int64, var * union[int64, ...'> array.type 4 * union[int64, var * union[int64, var * int64]] array.ndim -1 ``` So if we wanted to have an Xarray of Awkward Arrays, we'd have to take stock of all the assumptions Xarray makes about the arrays it contains.	{ "total_count": 5, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 5, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
667637217	https://github.com/pydata/xarray/issues/4285#issuecomment-667637217	https://api.github.com/repos/pydata/xarray/issues/4285	MDEyOklzc3VlQ29tbWVudDY2NzYzNzIxNw==	crusaderky 6213168	2020-08-02T06:56:23Z	2020-08-02T06:56:23Z	MEMBER	I think that xarray should offer a "compatibility test toolkit" to any numpy-like, NEP18-compatible library that wants to integrate with it. Instead of having a module full of tests specifically for pint, one for sparse, one for cupy, one for awkward, etc. etc. etc. those projects could just write a minimal test module like this: ```python import xarray import sparse xarray.testing.test_nep18_module( sparse, # TODO: lambda to create an array # TODO: list of xfails ) ``` which would automatically expand into a comprehensive suite of tests thanks to pytest parameterize/fixture magic. this would allow developers of numpy-like libraries to just test their package vs what's expected from a generic NEP-18 compliant package.	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088
665740365	https://github.com/pydata/xarray/issues/4285#issuecomment-665740365	https://api.github.com/repos/pydata/xarray/issues/4285	MDEyOklzc3VlQ29tbWVudDY2NTc0MDM2NQ==	jpivarski 1852447	2020-07-29T15:40:24Z	2020-07-29T15:40:24Z	NONE	I'm linking myself here, to follow this: @jpivarski.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Awkward array backend? 667864088

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

37 rows where issue = 667864088 sorted by updated_at descending

check context for "shape" and "dtype"

In [22]: da.isel(t=4)

In [26]: ak.mean(da)

Advanced export