assemblyx86cpu-registersmicro-optimizationmmx

MMX Register Speed vs Stack for Unsigned Integer Storage


I am contemplating an implementation of SHA3 in pure assembly. SHA3 has an internal state of 17 64 bit unsigned integers, but because of the transformations it uses, the best case could be achieved if I had 44 such integers available in the registers. Plus one scratch register possibly. In such a case, I would be able to do the entire transform in the registers.

But this is unrealistic, and optimisation is possible all the way down to even just a few registers. Still, more is potentially better, depending on the answer to this question.

I am thinking of using the MMX registers for fast storage at least, even if I'll need to swap into other registers for computation. But I'm concerned about that being ancient architecture.

Is data transfer between an MMX register and, say, RAX going to be faster than indexing u64s on the stack and accessing them from what's likely to be L1 cache? Or even if so, are there hidden pitfalls besides considerations of speed I should watch for? I am interested in the general case, so even if one was faster than the other on my computer, it might still be inconclusive.


Solution

  • Using ymm registers as a "memory-like" storage location - it's not a win for performance. MMX wouldn't be either. The use-case is for completely avoid memory accesses which might disturb a micro-benchmark.

    Efficient store-forwarding and fast L1d cache hits make using regular RAM very good. x86 allows memory operands, like add eax, [rdi], and modern CPUs can decode that to a single uop.

    With MMX you'd need 2 uops, like movd edx, mm0 / add eax, edx. So that's more uops, and more latency. movd or movq latency to/from MMX or XMM registers is worse than 3 to 5 cycle store-forwarding latency on typical modern CPUs.


    But if you don't need to move data back and forth often, you might be able to usefully keep some of your data in MMX / XMM registers and use pxor mm0, mm1 and so on.

    If you can schedule your algorithm so you have fewer total instructions / uops from using movd/movq (int<->XMM or int<->MMX) and movq2dq/movdq2q (MMX->XMM / XMM->MMX) instructions instead of stores and memory operands or loads, then it might be a win.

    But on Intel before Haswell, there are only 3 ALU execution ports, so the 4-wide superscalar pipeline could hit a narrower bottleneck (ALU throughput) than front-end throughput, if you leave the store/load ports idle.

    (See https://agner.org/optimize/ and other performance links in the x86 tag wiki.)