.netassemblysimdssex87

Why does .NET use SIMD and not x87 for math operations not intrinsic to SIMD?


This is a question of curiosity more than anything else. I was looking at this code disassembly (C#, 64 bit, Release mode, VS 2012 RC):

            double a = 10d * Math.Log(20d, 2d);
000000c8  movsd       xmm1,mmword ptr [00000138h] 
000000d0  movsd       xmm0,mmword ptr [00000140h] 
000000d8  call        000000005EDC7F50 
000000dd  movsd       mmword ptr [rsp+58h],xmm0 
000000e3  movsd       xmm0,mmword ptr [rsp+58h] 
000000e9  mulsd       xmm0,mmword ptr [00000148h] 
000000f1  movsd       mmword ptr [rsp+30h],xmm0 
            a = Math.Pow(a, 6d);
000000f7  movsd       xmm1,mmword ptr [00000150h] 
000000ff  movsd       xmm0,mmword ptr [rsp+30h] 
00000105  call        000000005F758220 
0000010a  movsd       mmword ptr [rsp+60h],xmm0 
00000110  movsd       xmm0,mmword ptr [rsp+60h] 
00000116  movsd       mmword ptr [rsp+30h],xmm0 

... and found it odd that the compiler isn't using x87 instructions for the Logs here (Power uses Logs). Of course, I have no idea what code is at the call locations, but I know that SIMD does not have a Log function, which makes this choice all the more odd. Further, nothing is parellelized here, so why SIMD and not simple x87?

On a lesser note, I also found it odd that the x87 FYL2X instruction isn't being used, which is designed specifically for the case shown in the first line of code.

Can anyone shed any light on this?


Solution

  • There are two separate points here. First of all why is the compiler using SSE registers rather than the x87 floating point stack for function arguments, and secondly why the compiler doesn't just use the single instruction that can compute a logarithm.

    Not using the logarithm instruction is easiest to explain, the logarithm instruction in x86 is defined to be accurate to 80-bits, whereas you are using a double, which is only 64-bits. Computing a logarithm to 64-bits rather than 80-bits of precision is much faster, and the speed increase more than makes up for having to do it in software rather than in silicon.

    The use of SSE registers is more difficult to explain in a way that's satistfactory. The simple answer is that the x64 calling convention requires the first four floating point arguments to a function to be passed at xmm0 through xmm3.

    The next question is, of course, why does the calling convention tell you to do this rather than use the floating point stack. The answer is that native x64 code rarely uses the x87 FPU at all, using SSE in replacement. This is because multiplication and division is faster in SSE (the 80-bit vs 64-bit issue again) and that the SSE registers are faster to manipulate (in the FPU you can only access the top of the stack, and rotating the FPU stack is often the slowest operation on a modern processor, in fact some have an extra pipeline stage solely for this purpose).