home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 1283043390

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4285#issuecomment-1283043390 https://api.github.com/repos/pydata/xarray/issues/4285 1283043390 IC_kwDOAMm_X85MebA- 1852447 2022-10-18T21:46:29Z 2022-10-18T21:46:29Z NONE

This sounds good to me! RaggedArray can be a well-defined subset of types that have clear use-cases (the ones @TomNicholas listed). The thing I was worried about is that most functions in Awkward Array ignore the boundary line between RaggedArray and non-RaggedArray; defining it as a new type with its own collection of functions (or just methods) in CloudDrift isolates it in a way that you can ensure that your functions stay within that boundary.

To represent a RaggedArray without wrapping an Awkward Array, you can store it as a sequence of n offsets arrays for depth-n lists of lists and numerical contents. (Whereas an Awkward Array is a tree of 1D buffers, RaggedArray can be a sequence of 1D buffers.) We can then make sure that when you do want to convert it to and from Awkward Arrays, you can do so in a zero-copy way. That way, if you want to define some functions, particularly __getitem__ with complex slices, by calling the corresponding Awkward function, you can do it by converting and then converting back, knowing that the conversion overhead is all O(1).

(Same for xarray.Dataset!)

I'm in favor of a video call meeting to discuss this. In general, I'm busiest on U.S. mornings, on Wednesday and Thursday, but perhaps you can send a when2meet or equivalent poll?


One thing that could be discussed in writing (maybe more easily) is what data types you would consider in scope for RaggedArray. (I've reminded myself of the use-cases above, but it doesn't fully answer this question.)

That is,

  1. You'll want the numerical data, the end of your sequence of 1D buffers, to be arbitrary NumPy types or some reasonable subset. That's a given.
  2. You'll want ragged arrays of those. Ragged arrays can be represented in several ways:
    • offsets buffer whose length is 1 more than the length of the array. Every neighboring pair of integers is the starting and stopping index of the content of a nested list. The integers must be non-decreasing since they are the cumulative sum of list lengths. This is pyarrow's ListArray and Awkward's ListOffsetArray.
    • starts and stops buffers of the same length as the length of the array. Every starts[i] and stops[i] is the starting and stopping index of the content of a nested list. These may be equivalent to an offsets buffer or they can be in a random order, not cover all of the content, or cover the content multiple times. The value of such a thing is that reordering, filtering, or duplicating the set of lists is not an operation that needs to propagate through every level of the sequence of buffers, so it's good for intermediate calculations. pyarrow has no equivalent, but it's Awkward's ListArray.
    • a parents buffer with the same length as the content; each parents[j] indicates which list j the content[j] belongs to. They may be contiguous or not. This is a Normal Form in database renormalization, and it's useful for some operations that propagate upward, such as reducers (sum, max, etc.). Neither pyarrow nor Awkward have this as a native type. It can't encode empty lists at the end of an array, so another integer would be needed to preserve that information.
  3. Will you want regular arrays? If some dimensions of the array are variable-length (ragged) and some are fixed-length, that can be accomplished by adding a node without any offsets/starts/stops/parents buffer, just an integer size. Multiple fixed dimensions could be multiple nested nodes (which is easier) or a tuple of integers shape. Is the nested data constrained to be contiguous or would you also have strides? (They can't be byte-strides, as in NumPy, they have to count numbers of items.) Can the size be zero? If so, you'll need another integer for the length of this array. This is pyarrow's FixedSizeListArray and Awkward's RegularArray (neither of which have strides, and only Awkward allows size=0).
  4. Will you want to allow for missing data? Only missing numerical values or also missing lists? Some functions naturally return missing data, such as max of an axis with variable length lists, some of which can be zero length. There's a variety of ways to represent missing data, though in a system of only nested lists, a bit-mask or byte-mask is probably best. All pyarrow node types are potentially missing, represented by a bit-mask, and Awkward as four node types for missing data, including BitMaskedArray and ByteMaskedArray.

You don't want record-types or union-types, so the only questions are how to implement (2) and whether you want (3) and (4). Including a type, such as missing data, allows for more function return values but obliges you to consider that type for all function arguments. You'll want to choose carefully how you close your system.

(Maybe this block of details can be copied to an issue where you're doing the development of RaggedArray in CloudDrift. It got longer than I had intended.)

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  667864088
Powered by Datasette · Queries took 0.635ms · About: xarray-datasette