issue_comments: 1283043390

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/4285#issuecomment-1283043390	https://api.github.com/repos/pydata/xarray/issues/4285	1283043390	IC_kwDOAMm_X85MebA-	1852447	2022-10-18T21:46:29Z	2022-10-18T21:46:29Z	NONE	This sounds good to me! `RaggedArray` can be a well-defined subset of types that have clear use-cases (the ones @TomNicholas listed). The thing I was worried about is that most functions in Awkward Array ignore the boundary line between `RaggedArray` and non-`RaggedArray`; defining it as a new type with its own collection of functions (or just methods) in CloudDrift isolates it in a way that you can ensure that your functions stay within that boundary. To represent a `RaggedArray` without wrapping an Awkward Array, you can store it as a sequence of n offsets arrays for depth-n lists of lists and numerical contents. (Whereas an Awkward Array is a tree of 1D buffers, `RaggedArray` can be a sequence of 1D buffers.) We can then make sure that when you do want to convert it to and from Awkward Arrays, you can do so in a zero-copy way. That way, if you want to define some functions, particularly `__getitem__` with complex slices, by calling the corresponding Awkward function, you can do it by converting and then converting back, knowing that the conversion overhead is all O(1). (Same for `xarray.Dataset`!) I'm in favor of a video call meeting to discuss this. In general, I'm busiest on U.S. mornings, on Wednesday and Thursday, but perhaps you can send a when2meet or equivalent poll? One thing that could be discussed in writing (maybe more easily) is what data types you would consider in scope for `RaggedArray`. (I've reminded myself of the use-cases above, but it doesn't fully answer this question.) That is, You'll want the numerical data, the end of your sequence of 1D buffers, to be arbitrary NumPy types or some reasonable subset. That's a given. You'll want ragged arrays of those. Ragged arrays can be represented in several ways: `offsets` buffer whose length is 1 more than the length of the array. Every neighboring pair of integers is the starting and stopping index of the content of a nested list. The integers must be non-decreasing since they are the cumulative sum of list lengths. This is pyarrow's `ListArray` and Awkward's `ListOffsetArray`. `starts` and `stops` buffers of the same length as the length of the array. Every `starts[i]` and `stops[i]` is the starting and stopping index of the content of a nested list. These may be equivalent to an `offsets` buffer or they can be in a random order, not cover all of the content, or cover the content multiple times. The value of such a thing is that reordering, filtering, or duplicating the set of lists is not an operation that needs to propagate through every level of the sequence of buffers, so it's good for intermediate calculations. pyarrow has no equivalent, but it's Awkward's `ListArray`. a `parents` buffer with the same length as the content; each `parents[j]` indicates which list `j` the `content[j]` belongs to. They may be contiguous or not. This is a Normal Form in database renormalization, and it's useful for some operations that propagate upward, such as reducers (sum, max, etc.). Neither pyarrow nor Awkward have this as a native type. It can't encode empty lists at the end of an array, so another integer would be needed to preserve that information. Will you want regular arrays? If some dimensions of the array are variable-length (ragged) and some are fixed-length, that can be accomplished by adding a node without any `offsets`/`starts`/`stops`/`parents` buffer, just an integer `size`. Multiple fixed dimensions could be multiple nested nodes (which is easier) or a tuple of integers `shape`. Is the nested data constrained to be contiguous or would you also have `strides`? (They can't be byte-strides, as in NumPy, they have to count numbers of items.) Can the `size` be zero? If so, you'll need another integer for the length of this array. This is pyarrow's `FixedSizeListArray` and Awkward's `RegularArray` (neither of which have `strides`, and only Awkward allows `size=0`). Will you want to allow for missing data? Only missing numerical values or also missing lists? Some functions naturally return missing data, such as `max` of an `axis` with variable length lists, some of which can be zero length. There's a variety of ways to represent missing data, though in a system of only nested lists, a bit-mask or byte-mask is probably best. All pyarrow node types are potentially missing, represented by a bit-mask, and Awkward as four node types for missing data, including `BitMaskedArray` and `ByteMaskedArray`. You don't want record-types or union-types, so the only questions are how to implement (2) and whether you want (3) and (4). Including a type, such as missing data, allows for more function return values but obliges you to consider that type for all function arguments. You'll want to choose carefully how you close your system. (Maybe this block of details can be copied to an issue where you're doing the development of `RaggedArray` in CloudDrift. It got longer than I had intended.)	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		667864088