assemblyinline-assemblyarm64intrinsicsneon

Using Horizontal Neon intrinsics efficiently


Reading from ARM Instruction Set Reference, the operations performing horizontal reduction do keep the destination value in neon register.

However, both the intrinsics definition and the clang implementation cast the return value to a scalar type:

__ai uint32_t vaddvq_u32(uint32x4_t __p0) {
  uint32_t __ret;
  __ret = (uint32_t) __builtin_neon_vaddvq_u32(__p0);
  return __ret;
}

To me this seems like some valuable information is lost - the implementation and the reference guide are only implicit in that all the other bits are zeroed, so in order to do

uint16x4_t a(uint8x8_t b) {
    return vdup_n_u16(vaddv_u8(b));
}

I would expect to get the assembly

   addv    b0, v0.8b
   dup     v0.4h, v0.h[0]

instead of

    addv    b0, v0.8b
    fmov    w8, s0
    dup     v0.4h, w8

This is likely a missed optimisation, but to me it seems to be also a design error and the question would then be, if there's a way to circumvent this behaviour of the cast to scalar -- or to implement it in inline assembly. What I've tried is

asm( " addv    %0.h, %0.8h " : "+w"(phase4));

but that is "obviously" wrong, as the destination type is not "w" making an invalid substitution addv v30.h, v30.8h, which refuses to compile. So at least I'm missing the register modifier for the 16-bit first element of a vector.


Solution

  • For the inline assembly approach, there are template modifiers to output the b/h/s/d/q name of a v register. That link is for armclang but they are also supported by mainline clang and by gcc (though awkwardly gcc doesn't document them and doesn't seem interested in doing so).

    So you can do

    asm( " addv    %h0, %0.8h " : "+w"(phase4));
    

    which ought to emit addv h30, v30.8h.

    I don't know how to get the compiler to emit this by itself. I agree that it is a missed optimization, and a rather unfortunate one, since on many machines, transfers between the general and fp/simd registers are expensive. For Cortex A-72, fmov Wn, Sm is 5 cycles latency, and dup Vn.xx, Wm is 8 cycles. On the other hand, dup Vn.xx, Vm.y[i] is only 3 cycles. So this missed optimization costs us an unnecessary 10 cycles of latency.

    Incidentally, gcc had the same missed optimization through 11.x - even worse because it threw in an extra unnecessary and w0, w0, #255. But in 12.x and later it optimizes it as we wish, keeping the value in the vector registers.