assemblyarmintrinsicsneon

Transpose 4x4 int32 matrix using NEON


How can I efficiently transpose a matrix represented as four int32x4t values? I cannot use ld4q_s32 and st4q_s32.


Solution

  • Transposing is typically implemented in the recursive form:

       [A B]' == [A' C']
       [C D]     [B' D']
    

    where each A,B,C,D can be though of being a square matrix (and A' == A, for 1x1 matrices).

    Using transpose will unfortunately help you only on the level that needs to transpose the lowest 2x2 matrices in parallel -- the ARM64 instruction set lacks an instruction to transpose the highest level, i.e. 64 bits at a time, which was done on arm-v7 as swp d2, d0.

    There's anyway another iterative formula for transpose: it's repeatedly unzipping or zipping the input.

    in octave/matlab language we define the zip operator as one, that takes the first N/2 elements of an array a and interleaves them with the last N/2 elements of an array.

      zip = @(a) [a(1:end/2);a(end/2+1:end)](:)';
      zip(0:15) %--> 0  8  1  9  2 10  3 11  4 12  5 13  6 14  7 15
      zip(ans)  %--> 0  4  8 12  1  5  9 13  2  6 10 14  3  7 11 15
    

    Thus the complete code using intrinsics could be something like:

    inline int32x4x4_t zip(int32x4x4_t a) {
       return { vzip1q_s32(a.val[0], a.val[2]),
                vzip2q_s32(a.val[0], a.val[2]),
                vzip1q_s32(a.val[1], a.val[3]),
                vzip2q_s32(a.val[1], a.val[3]) };
    }
    
    int32x4x4_t transpose(int32x4x4_t a) {
        return zip(zip(a));
    }
    
    

    On some architectures one can also utilise vtbl, if one has 8 extra registers to spare (4 for the indices and 4 for the output)

    uint8x16x4_t transpose_vtbl(uint8x16x4_t a) {
       return {
          vqtbl4q_u8(a, idx0),
          vqtbl4q_u8(a, idx1),
          vqtbl4q_u8(a, idx2),
          vqtbl4q_u8(a, idx3)};
    }
    

    Here the int32x4x4_t input needs to be cast element-wise with vreinterpretq_u8_s32(input.val[i]) and back, with idx0 having the values of 0 1 2 3 16 17 18 19 32 33 34 35 48 49 50 51, idx1 == idx0 + 4 and so on.

    On M1 or M2 VTBL seems to instruction level parallelise, contrary to eg. a moderately recent cortex A75 (v8.2), where the vtbl takes 3N+1 clock cycles with no dual-issue capabilities.