awkward-array

Plotting with different length of jagged arrays


I have a problem when trying to plot 2d histogram or graph with different length of jagged arrays.

Here is a simple example. Suppose there are 7 events of gen-level pT and its Et.

pT = [ [46.8], [31.7], [21], [29.9], [13.9], [41.2], [15.7] ]
Et = [ [41.4], [25.5, 20], [19.6], [27.4], [12, 3.47], [37.8], [10] ]

Here, some events (2nd, 5th) have two y values corresponding one x value. I want to make graph or 2d histogram putting x = pt and y = et, and put two points together. i.e (31.7, 25.5) and (31.7, 20)

How can I make align these values for plotting?


Solution

  • What you want to do is "broadcast" the two arrays:

    Awkward broadcasting is a generalization of NumPy broadcasting to include variable-length lists.

    Broadcasting usually happens automatically when you're performing a mathematical calculation:

    >>> import awkward1 as ak
    >>> ak.Array([[1, 2, 3], [], [4, 5]]) + ak.Array([100, 200, 300])
    <Array [[101, 102, 103], [], [304, 305]] type='3 * var * int64'>
    

    but you can also do it manually:

    >>> ak.broadcast_arrays(ak.Array([[1, 2, 3], [], [4, 5]]),
    ...                     ak.Array([100, 200, 300]))
    [<Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>,
     <Array [[100, 100, 100], [], [300, 300]] type='3 * var * int64'>]
    

    When the two arrays have different depths (different "dimensions" in NumPy terminology), scalars from one are replicated to align with all elements of lists in the other.

    You have two lists of the same depth:

    >>> pT = ak.Array([ [46.8], [31.7], [21], [29.9], [13.9], [41.2], [15.7] ])
    >>> Et = ak.Array([ [41.4], [25.5, 20], [19.6], [27.4], [12, 3.47], [37.8], [10] ])
    

    To manually broadcast them, you could reduce the depth of pT by taking the first element from each list.

    >>> pT[:, 0]
    <Array [46.8, 31.7, 21, ... 13.9, 41.2, 15.7] type='7 * float64'>
    

    Then you can broadcast each scalar of pT into each list of Et.

    >> ak.broadcast_arrays(pT[:, 0], Et)
    [<Array [[46.8], [31.7, 31.7, ... 41.2], [15.7]] type='7 * var * float64'>,
     <Array [[41.4], [25.5, 20], ... [37.8], [10]] type='7 * var * float64'>]
    

    This will be more clear if I print them in their entirety by turning them into Python lists:

    >>> pT_broadcasted, Et = ak.broadcast_arrays(pT[:, 0], Et)
    >>> pT_broadcasted.tolist()
    [[46.8], [31.7, 31.7], [21.0], [29.9], [13.9, 13.9], [41.2], [15.7]]
    >>> Et.tolist()
    [[41.4], [25.5, 20.0], [19.6], [27.4], [12.0, 3.47], [37.8], [10.0]]
    

    Now you see that the 31.7 has been duplicated to align with each value in [25.5, 20.0].

    In NumPy, you'll often see examples of broadcasting a dimension of length 1, rather than creating a dimension, like this:

    >>> import numpy as np
    >>> np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) + np.array([[100], [200], [300]])
    array([[101, 102, 103],
           [204, 205, 206],
           [307, 308, 309]])
    

    Awkward Array follows this rule, but only if the dimension has length "exactly 1," not "a bunch of variable-length lists that happen to each have length 1." The way I've written pT, it has the latter:

    >>> ak.type(pT)     # 7 lists with variable length
    7 * var * float64
    >>> ak.num(pT)      # they happen to each have length 1... this time...
    <Array [1, 1, 1, 1, 1, 1, 1] type='7 * int64'>
    

    Since these lists are in-principle variable, they don't broadcast the way that length-1 NumPy arrays would.

    >>> ak.broadcast_arrays(pT, Et)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/jpivarski/irishep/awkward-1.0/awkward1/operations/structure.py", line 699, in broadcast_arrays
        out = awkward1._util.broadcast_and_apply(inputs, getfunction, behavior)
      File "/home/jpivarski/irishep/awkward-1.0/awkward1/_util.py", line 972, in broadcast_and_apply
        out = apply(broadcast_pack(inputs, isscalar), 0)
      File "/home/jpivarski/irishep/awkward-1.0/awkward1/_util.py", line 745, in apply
        outcontent = apply(nextinputs, depth + 1)
      File "/home/jpivarski/irishep/awkward-1.0/awkward1/_util.py", line 786, in apply
        nextinputs.append(x.broadcast_tooffsets64(offsets).content)
    ValueError: in ListOffsetArray64, cannot broadcast nested list
    
    (https://github.com/scikit-hep/awkward-1.0/blob/0.3.2/src/cpu-kernels/operations.cpp#L778)
    

    If you explicitly cast the array as NumPy, it will have regular types. (Note to self: it would be nice to have a way to turn a variable-length dimension regular or vice-versa without converting the whole array to NumPy.)

    >>> ak.type(pT)
    7 * var * float64
    >>> ak.type(ak.to_numpy(pT))
    7 * 1 * float64
    

    So an alternate way to get the same broadcasting is to convert pT to NumPy, instead of picking out the first element of each list with pT[:, 0].

    >>> ak.broadcast_arrays(ak.to_numpy(pT), Et)
    [<Array [[46.8], [31.7, 31.7, ... 41.2], [15.7]] type='7 * var * float64'>,
     <Array [[41.4], [25.5, 20], ... [37.8], [10]] type='7 * var * float64'>]
    

    Either way, an assumption is being made that pT consists of lists of length 1. The pT[:, 0] expression assumes this because it requires something to have index 0 in each list (so the length is at least 1) and it ignores whatever else might be there. The ak.to_numpy(pT) expression will raise an exception if the pT array doesn't happen to be regular, a shape that can be expressed in NumPy.

    Now that you have pT_broadcasted and Et aligned with the same structure, you'll have to flatten them both to pass them to a plotting routine (which expects non-jagged data).

    >>> ak.flatten(pT_broadcasted), ak.flatten(Et)
    (<Array [46.8, 31.7, 31.7, ... 13.9, 41.2, 15.7] type='9 * float64'>,
     <Array [41.4, 25.5, 20, ... 3.47, 37.8, 10] type='9 * float64'>)
    

    The plotting routine will probably try np.asarray on each of these, which is identical to ak.to_numpy, which will work because these flattened arrays are regular. If you have doubly jagged data or something more complex, you'd have to flatten more.