How can I efficiently transpose a matrix represented as four int32x4t
values? I cannot use ld4q_s32
and st4q_s32
.
Transposing is typically implemented in the recursive form:
[A B]' == [A' C']
[C D] [B' D']
where each A,B,C,D
can be though of being a square matrix (and A' == A, for 1x1 matrices).
Using transpose will unfortunately help you only on the level that needs to transpose the lowest 2x2 matrices in parallel -- the ARM64 instruction set lacks an instruction to transpose the highest level, i.e. 64 bits at a time, which was done on arm-v7 as swp d2, d0
.
There's anyway another iterative formula for transpose: it's repeatedly unzipping or zipping the input.
in octave/matlab language we define the zip operator as one, that takes the first N/2 elements of an array a
and interleaves them with the last N/2 elements of an array.
zip = @(a) [a(1:end/2);a(end/2+1:end)](:)';
zip(0:15) %--> 0 8 1 9 2 10 3 11 4 12 5 13 6 14 7 15
zip(ans) %--> 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15
Thus the complete code using intrinsics could be something like:
inline int32x4x4_t zip(int32x4x4_t a) {
return { vzip1q_s32(a.val[0], a.val[2]),
vzip2q_s32(a.val[0], a.val[2]),
vzip1q_s32(a.val[1], a.val[3]),
vzip2q_s32(a.val[1], a.val[3]) };
}
int32x4x4_t transpose(int32x4x4_t a) {
return zip(zip(a));
}
On some architectures one can also utilise vtbl
, if one has 8 extra registers to spare (4 for the indices and 4 for the output)
uint8x16x4_t transpose_vtbl(uint8x16x4_t a) {
return {
vqtbl4q_u8(a, idx0),
vqtbl4q_u8(a, idx1),
vqtbl4q_u8(a, idx2),
vqtbl4q_u8(a, idx3)};
}
Here the int32x4x4_t input
needs to be cast element-wise with vreinterpretq_u8_s32(input.val[i])
and back, with idx0
having the values of 0 1 2 3 16 17 18 19 32 33 34 35 48 49 50 51
, idx1 == idx0 + 4
and so on.
On M1 or M2 VTBL seems to instruction level parallelise, contrary to eg. a moderately recent cortex A75 (v8.2), where the vtbl takes 3N+1 clock cycles with no dual-issue capabilities.