assemblycpu-architectureapple-m1arm64cpu-registers

Performance advantage of 32bit registers in AArch64?


When doing integer operations in AArch64/ARM64, is there a performance difference when using 32bit W{n} registers versus 64bit X{n} registers?

For example, is add W1, W2, W3 any faster than add X1, X2, X3? Is sdiv W1, W2, W3 faster than sdiv X1, X2, X3? Could it be different depending on implementation (like Apple M1/M2/M3 vs. a 64bit Qualcomm Snapdragon)?

My intuition is there is a minor performance advantage when using W{n}, but I'm not sure whether it actually matters except in tight loops. I'm interested in official ARM documentation talking about this, if there is one. In assembly code I'm currently writing, I'm using mostly X{n} for consistency, but am wondering whether it's worth switching to W{n} when I know/expect the data to fit into 32 bits.


Solution

  • The links provided by @PeterCordes and comment by @NateEldredge have sent me down some interesting rabbit holes.

    tl;dr: For arithmetics like ADD, SUB, LSL and so on, there is no performance difference when using W{n} vs. X{n}. However, there is a slight W{n} advantage when doing udiv/sdiv. Depending on implementation (Cortex vs. M1), the way ldp and stp are called can yield a tiny difference.

    Sources:


    Cortex-A77

    I suspect this probably applies to most AArch64 implementations.


    Apple M1

    The implementation of AArch64 by Apple significantly differs from Cortex versions. Here's what I was able to find out:


    Conclusion

    As far as I can tell, one of the few cases one might care about W-form vs. X-form is when doing lots of udiv/sdiv on Cortex. On M1, the difference is tiny. Overall, the differences are small when they do exist and I suspect simply don't matter much in real-life code.

    The other scenario that occurs to me where the difference might be important is implementing a memcpy with unrolled ldp/stp: on Cortex, doing signed-offset and just one pre- or post-index call could be slightly faster, while on M1 it's probably better to use pre- or post-index calls for all ldp/stp.