I use ARM NEON intrinsic vcombine_f32
, which has equivalent instruction DUP Vd.1D,Vn.D[0]
.
For the purpose of this question, I call such equivalent instruction a primary instruction (primary for this particular intrinsic). I do acknowledge that there may be a better term.
Generated by GCC 9.4.0 ASM code has DUP
, which is expected.
Generated by GCC 14.2.0 ASM code has no DUP
, which may be unexpected (at the first glance).
How to prevent GCC from generating non-primary instructions for ARM NEON intrinsics?
For example:
x =
__attribute__((do_not_generate_non_primary_instructions_for_intrinsics))
vcombine_f32(a, b); // DUP shall be generated
UPD. Generation of DUP
for vcombine_f32
should not be expected, because "Neon programmers' guide" for VCOMBINE
has "Related Instruction", which says:
The intrinsic does not generate an instruction.
I don't think the "primary" instruction is DUP. I'd suggest it could be INS (or mov) instead, because DUP broadcasts a single element, but mov v0.d[1], v1.d[0]
would preserve the d0
part of the q0
register.
Some of these intrinsics are likely modelled merely as generic permutations, which is somehow coupled with register allocator trying to (but often failing) to make the most efficient sequence of instructions.
You can anyway surround your code with optimisation barriers like asm("" : "+w"(my_variable));
which will prevent the statements/expressions to propagate to further optimisations -- but it's not a silver bullet (especially against LLVM's shuffle optimizer; GCC is often closer to literal), and it won't even prevent reordering of the instructions.
uint8x16_t combine_forced(uint8x8_t a, uint8x8_t b) {
uint8x16_t c = vcombine_u8(a,b);
asm("" : "+w"(c));
return c;
}
GCC64: ins v0.d[1], v1.d[0]
clang: mov v0.d[1], v1.d[0]