pythonnumpytensorflownumpy-ndarrayawkward-array

How can I convert the datatype of a numpy array sourced from an awkward array


I have a numpy array I converted from awkward array by to_numpy() function, and the resulting array has the datatype: dtype=[('phi', '<f8'), ('eta', '<f8')]). I want to make it a regular tuple of (float32, float32) because otherwise this does not convert into a tensorflow tensor

I tried the regular asdtype functions but all I get is errors

>>> array = ak.Array([{"phi": 1.1, "eta": 2.2}, {"phi": 3.3, "eta": 4.4}])
>>> ak.to_numpy(array)
array([(1.1, 2.2), (3.3, 4.4)], dtype=[('phi', '<f8'), ('eta', '<f8')])

Solution

  • I believe your problem is equivalent to this: you have some Awkward Array with record structure,

    >>> array = ak.Array([{"phi": 1.1, "eta": 2.2}, {"phi": 3.3, "eta": 4.4}])
    

    and when you convert that with ak.to_numpy, it turns the record fields into NumPy structured array fields:

    >>> ak.to_numpy(array)
    array([(1.1, 2.2), (3.3, 4.4)], dtype=[('phi', '<f8'), ('eta', '<f8')])
    

    ML libraries like TensorFlow and PyTorch want the feature vectors to not have fields with names, but instead be 2D arrays in which the second dimension ranges over all of the features. If all of the NumPy structured array dtypes are identical, as they're all <f8 in this example, you could view it:

    >>> ak.to_numpy(array).view("<f8").reshape(len(array), -1)
    array([[1.1, 2.2],
           [3.3, 4.4]])
    

    But this is unsafe. If, for example, some of your fields are 32-bit and others are 64-bit, or some are integers and others are floating-point, view will just reinterpret the memory, losing the meaning of the numbers:

    >>> bad = np.array([(1, 2, 3.3), (4, 5, 6.6)], dtype=[("x", "<i4"), ("y", "<i4"), ("z", "<f8")])
    >>> bad.view("<f8").reshape(len(bad), -1)
    array([[4.24399158e-314, 3.30000000e+000],
           [1.06099790e-313, 6.60000000e+000]])
    

    (z's 3.3 and 6.6 are preserved, but x and y get merged into a single field and the raw memory gets interpreted as floats.)

    Instead, we should make the structure appropriate in Awkward, which has the tools to do exactly this sort of thing, and afterward convert it to NumPy (and from there to TensorFlow or PyTorch).

    So, we're starting with an array of records with named fields:

    >>> array
    <Array [{phi: 1.1, eta: 2.2}, {...}] type='2 * {phi: float64, eta: float64}'>
    

    We want the named fields to go away and make these individual arrays. That's ak.unzip.

    >>> ak.unzip(array)
    (<Array [1.1, 3.3] type='2 * float64'>, <Array [2.2, 4.4] type='2 * float64'>)
    

    (The first in the tuple is from phi, the second is from eta.)

    We want to get values for each field together into the same input vector for the ML model. That is, 1.1 and 2.2 should be in a vector [1.1, 2.2] and 3.3 and 4.4 should be in a vector [3.3, 4.4]. That's a concatenation of the arrays in this tuple, but not an axis=0 concatenation that would make [1.1, 3.3, 2.2, 4.4]; it has to be a concatenation in a higher axis=1. That axis doesn't exist yet, but we can always make length-1 axes with np.newaxis.

    >>> ak.unzip(array[:, np.newaxis])
    (<Array [[1.1], [3.3]] type='2 * 1 * float64'>, <Array [[2.2], [4.4]] type='2 * 1 * float64'>)
    

    Now ak.concatenate with axis=1 will concatenate [1.1] and [2.2] into [1.1, 2.2], etc.

    >>> ak.concatenate(ak.unzip(array[:, np.newaxis]), axis=1)
    <Array [[1.1, 2.2], [3.3, 4.4]] type='2 * 2 * float64'>
    

    So in the end, here's a one-liner that you can pass to TensorFlow that will work even if your record fields have different dtypes:

    >>> ak.to_numpy(ak.concatenate(ak.unzip(array[:, np.newaxis]), axis=1))
    array([[1.1, 2.2],
           [3.3, 4.4]])
    

    Or, actually, maybe you can skip the ak.to_numpy and go straight to ak.to_tensorflow.