cassemblyfloating-pointssex87

Why do modern compilers prefer SSE over FPU for single floating-point operations


I recently tried to read assemblies of the binary of my code and found that a lot of floating-point operations are done using XMM registers and SSE instructions. For example, the following code:

float square(float a) {
    float b = a + (a * a);
    return b;
} 

will be compiled into

push    rbp
mov     rbp, rsp
movss   DWORD PTR [rbp-20], xmm0
movss   xmm0, DWORD PTR [rbp-20]
mulss   xmm0, xmm0
movss   xmm1, DWORD PTR [rbp-20]
addss   xmm0, xmm1
movss   DWORD PTR [rbp-4], xmm0
movss   xmm0, DWORD PTR [rbp-4]
pop     rbp
ret

and the result is similar for other compilers. https://godbolt.org/z/G988PGo6j

And with -O3 flag

movaps  xmm1, xmm0
mulss   xmm0, xmm0
addss   xmm0, xmm1
ret

Does this mean operations using SIMD registers and instructions are usually faster than using normal registers and the FPU?

Also, I'm curious about specific cases where the compiler's decision to use SSE might fail.


Solution

  • SSE was developed as a replacement for the x87 FPU as the x87 FPU's design is a bit idiosyncratic and hard to generate code for. The main issues are:

    I recommend to only use the x87 FPU if code size is an issue or if you require the 80 bits floating point format. Otherwise stick with SSE or (on recent processors) AVX.