After looking at a table of registers in the x86/x64 architecture, I noticed that there's a whole section of 128, 256, and 512-bit registers that I've never seen them being used in assembly, or decompiled C/C++ code: XMM(0-15) for 128, YMM(0-15) for 256, ZMM(0-31) 512.
After doing a bit of digging what I've gathered is that you have to use 2 64-bit operations in order to perform math on a 128-bit number, instead of using generic add
, sub
, mul
, div
operations. If this is the case, then what exactly is the use of having these expanded register sets, and are there any assembly operations that you can use in order to manipulate them?
Those are used in
Nowadays they're also commonly used for
memcpy
) because they can copy a much larger amount of data in an iteration:
See also
you have to use 2 64-bit operations in order to perform math on a 128-bit number
No, they're not meant for that purpose and you can't use them for 128-bit numbers easily. It's much much faster to add a 128-bit number with only 2 instructions: add rax, rbx; adc rdx, rcx
instead of a ton of instructions if dealing with XMM registers. They may help in case of larger integers though. See
Regarding their applications, firstly they're used for scalar floating-point operations. So if you have float
or double
in C or C++ then they're most likely be stored in the low part of XMM registers and manipulated by instructions ending in ss
(scalar single) or sd
(scalar double)
In fact there is another set of eight 80-bit ST(x)
registers which was available with the x87 co-processor for doing floating-point math. However they're slow and less predictable. Slow because operations are done in higher precision by default, which inherently needs more work and also requires a store then load to round to lower precision if necessary. Unpredictable is also because of the high precision. That might feel strange at first, but it's easy to explain, for example some operations overflow or underflow in float
or double
precision, but not in long double
precision. That causes many bugs or unexpected results in 32 and 64-bit build1
Here is a floating-point example on both sets of registers
// f = x/z + y*z
x87:
fld dword ptr [esp + 12]
fld st(0)
fdivr dword ptr [esp + 4]
fxch st(1)
fmul dword ptr [esp + 8]
faddp st(1)
ret
SSE:
divss xmm0, xmm2
mulss xmm1, xmm2
addss xmm0, xmm1
ret
AVX:
vdivss xmm0, xmm0, xmm2
vmulss xmm1, xmm1, xmm2
vaddss xmm0, xmm0, xmm1
ret
The move to the faster and more consistent SSE registers is one of the reasons why the 80-bit extended precision long double
type is not available in MSVC anymore
Then Intel introduced the MMX instruction set for SIMD operations which uses the same ST(x)
registers with the new name MMX
. MMX might stand for Multiple Math eXtension or Matrix Math eXtension, but IMHO it's most likely or MultiMedia eXtension, since multimedia and the internet increasingly became important at the time. In multimedia solutions you very often have to do the same operations to each pixel, texel, sound sample... like these
for (int i = 0; i < 100000; ++i)
{
A[i] = B[i] + C[i];
D[i] = E[i] * F[i];
}
Instead of operating on each element separately, we can speed up by doing multiple elements at a time. That's the reason people invented SIMD. With MMX you can increase the brightness of 8 pixel channels, or volume of four 16-bit sound samples at once... Operations on a single element is called scalar, and the full register is called a vector, which is a set of scalar values
Due to MMX's drawbacks (like the reuse of ST
registers, or the lack of floating-point support), when extended the SIMD instruction set with Streaming SIMD Extensions (SSE) Intel decided to give them a completely new set of registers named XMM which is twice longer (128 bits), so now we can operate on 16 bytes at once. And it also supports multiple floating-point operations at once. Then Intel lengthened XMM to the 256-bit YMM in Advanced Vector Extensions (AVX), and doubled the length once again in AVX-512 (this time it also increased the number of registers to 32 in 64-bit mode). Now you can work on sixteen 32-bit integers at a time
From the above you might understand the second and most important role of those registers: doing operations on multiple data in parallel with a single instruction. For example in SSE4 a set of instructions to work on C strings have been introduced. Now you can count string length, find sub-strings... much faster by checking multiple bytes at once. You can also copy or compare memory a lot faster. Modern memcpy
implementations move 16, 32 or 64 bytes at a time depending on the largest register width instead of one-by-one as in the simplest C solution.
Unfortunately the compilers are still bad at converting from scalar code into parallel code so most of the times we have to help them, although auto-vectorization is still getting better and smarter
Due to the importance of SIMD, pretty much any high performance architectures nowadays have their own versions of SIMD, like Altivec on PowerPC, V extension on RISC-V or Neon/SVE on ARM.
1Some examples: