What are the 128-bit to 512-bit registers used for?

After looking at a table of registers in the x86/x64 architecture, I noticed that there's a whole section of 128, 256, and 512-bit registers that I've never seen them being used in assembly, or decompiled C/C++ code: XMM(0-15) for 128, YMM(0-15) for 256, ZMM(0-31) 512.

After doing a bit of digging what I've gathered is that you have to use 2 64-bit operations in order to perform math on a 128-bit number, instead of using generic add, sub, mul, div operations. If this is the case, then what exactly is the use of having these expanded register sets, and are there any assembly operations that you can use in order to manipulate them?

Solution

Those are used in

Floating-point operations
Operations on multiple pieces of data at once (like in arrays, vectors or matrices)

Nowadays they're also commonly used for

mass data movement/operation (like in memcpy) because they can copy a much larger amount of data in an iteration:
string operations (to avoid iterating char-by-char):

See also

you have to use 2 64-bit operations in order to perform math on a 128-bit number

No, they're not meant for that purpose and you can't use them for 128-bit numbers easily. It's much much faster to add a 128-bit number with only 2 instructions: add rax, rbx; adc rdx, rcx instead of a ton of instructions if dealing with XMM registers. They may help in case of larger integers though. See

Regarding their applications, firstly they're used for scalar floating-point operations. So if you have float or double in C or C++ then they're most likely be stored in the low part of XMM registers and manipulated by instructions ending in ss (scalar single) or sd (scalar double)

In fact there is another set of eight 80-bit ST(x) registers which was available with the x87 co-processor for doing floating-point math. However they're slow and less predictable. Slow because operations are done in higher precision by default, which inherently needs more work and also requires a store then load to round to lower precision if necessary. Unpredictable is also because of the high precision. That might feel strange at first, but it's easy to explain, for example some operations overflow or underflow in float or double precision, but not in long double precision. That causes many bugs or unexpected results in 32 and 64-bit build¹

Here is a floating-point example on both sets of registers

// f = x/z + y*z
x87:
        fld     dword ptr [esp + 12]
        fld     st(0)
        fdivr   dword ptr [esp + 4]
        fxch    st(1)
        fmul    dword ptr [esp + 8]
        faddp   st(1)
        ret
SSE:
        divss   xmm0, xmm2
        mulss   xmm1, xmm2
        addss   xmm0, xmm1
        ret
AVX:
        vdivss  xmm0, xmm0, xmm2
        vmulss  xmm1, xmm1, xmm2
        vaddss  xmm0, xmm0, xmm1
        ret

The move to the faster and more consistent SSE registers is one of the reasons why the 80-bit extended precision long double type is not available in MSVC anymore

Then Intel introduced the MMX instruction set for SIMD operations which uses the same ST(x) registers with the new name MMX. MMX might stand for Multiple Math eXtension or Matrix Math eXtension, but IMHO it's most likely or MultiMedia eXtension, since multimedia and the internet increasingly became important at the time. In multimedia solutions you very often have to do the same operations to each pixel, texel, sound sample... like these

for (int i = 0; i < 100000; ++i)
{
   A[i] = B[i] + C[i];
   D[i] = E[i] * F[i];
}

Instead of operating on each element separately, we can speed up by doing multiple elements at a time. That's the reason people invented SIMD. With MMX you can increase the brightness of 8 pixel channels, or volume of four 16-bit sound samples at once... Operations on a single element is called scalar, and the full register is called a vector, which is a set of scalar values

Due to MMX's drawbacks (like the reuse of ST registers, or the lack of floating-point support), when extended the SIMD instruction set with Streaming SIMD Extensions (SSE) Intel decided to give them a completely new set of registers named XMM which is twice longer (128 bits), so now we can operate on 16 bytes at once. And it also supports multiple floating-point operations at once. Then Intel lengthened XMM to the 256-bit YMM in Advanced Vector Extensions (AVX), and doubled the length once again in AVX-512 (this time it also increased the number of registers to 32 in 64-bit mode). Now you can work on sixteen 32-bit integers at a time

From the above you might understand the second and most important role of those registers: doing operations on multiple data in parallel with a single instruction. For example in SSE4 a set of instructions to work on C strings have been introduced. Now you can count string length, find sub-strings... much faster by checking multiple bytes at once. You can also copy or compare memory a lot faster. Modern memcpy implementations move 16, 32 or 64 bytes at a time depending on the largest register width instead of one-by-one as in the simplest C solution.

Unfortunately the compilers are still bad at converting from scalar code into parallel code so most of the times we have to help them, although auto-vectorization is still getting better and smarter

Due to the importance of SIMD, pretty much any high performance architectures nowadays have their own versions of SIMD, like Altivec on PowerPC, V extension on RISC-V or Neon/SVE on ARM.

¹Some examples: