When doing integer operations in AArch64/ARM64, is there a performance difference when using 32bit W{n} registers versus 64bit X{n} registers?
For example, is add W1, W2, W3
any faster than add X1, X2, X3
? Is sdiv W1, W2, W3
faster than sdiv X1, X2, X3
? Could it be different depending on implementation (like Apple M1/M2/M3 vs. a 64bit Qualcomm Snapdragon)?
My intuition is there is a minor performance advantage when using W{n}, but I'm not sure whether it actually matters except in tight loops. I'm interested in official ARM documentation talking about this, if there is one. In assembly code I'm currently writing, I'm using mostly X{n} for consistency, but am wondering whether it's worth switching to W{n} when I know/expect the data to fit into 32 bits.
The links provided by @PeterCordes and comment by @NateEldredge have sent me down some interesting rabbit holes.
tl;dr: For arithmetics like ADD
, SUB
, LSL
and so on, there is no performance difference when using W{n} vs. X{n}. However, there is a slight W{n} advantage when doing udiv
/sdiv
. Depending on implementation (Cortex vs. M1), the way ldp
and stp
are called can yield a tiny difference.
Sources:
I suspect this probably applies to most AArch64 implementations.
udiv
, sdiv
: Exec latency of 5 to 12 for W-form, 5 to 20 for X-form.madd
, msub
) and thus no difference for "pure" multiply.ldp
, ldnp
with signed immediate offset: Throughput of 2 for W-form, but just 1 for X-form.stp
, stnp
with signed immediate offset: Throughput of 2 for W-form, but just 1 for X-form.ldp W0, W1, [SP, #-16]
(no exclamation mark) has a penalty, but ldp W0, W1, [SP, #-16]!
and ldp W0, W1, [SP], #16
do not!The implementation of AArch64 by Apple significantly differs from Cortex versions. Here's what I was able to find out:
ldnp
and stnp
have no difference when using W-form or X-form.ldp
and stp
are very slightly faster in W-form for pre- and post-index, but for the signed-offset case they're the same speed.ldr
and str
are also very slightly faster in W-form. The difference seems to be even smaller than with ldp
/stp
.udiv
, sdiv
: Exec latency of 7 to 8 for W-form, 7 to 9 for X-form.mov W0, W1
must be executed, whereas mov X0, X1
is just a register rename internally.mov
from/to SP is slightly slower with W{n} registers.As far as I can tell, one of the few cases one might care about W-form vs. X-form is when doing lots of udiv
/sdiv
on Cortex. On M1, the difference is tiny. Overall, the differences are small when they do exist and I suspect simply don't matter much in real-life code.
The other scenario that occurs to me where the difference might be important is implementing a memcpy
with unrolled ldp
/stp
: on Cortex, doing signed-offset and just one pre- or post-index call could be slightly faster, while on M1 it's probably better to use pre- or post-index calls for all ldp
/stp
.