Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?

I was reading today about researchers discovering that NVidia's Phys-X libraries use x87 FP vs. SSE2. Obviously this will be suboptimal for parallel datasets where speed trumps precision. However, the article author goes on to quote:

Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecated x87 since the K8 in 2003, as x86-64 is defined with SSE2 support; VIA’s C7 has supported SSE2 since 2005. In 64-bit versions of Windows, x87 is deprecated for user-mode, and prohibited entirely in kernel-mode. Pretty much everyone in the industry has recommended SSE over x87 since 2005 and there are no reasons to use x87, unless software has to run on an embedded Pentium or 486.

I wondered about this. I know that x87 uses 80-bit extended doubles internally to compute values, and SSE2 doesn't. Does this not matter to anyone? It seems surprising to me. I know when I do computations on points, lines and polygons in a plane, values can be surprisingly wrong when doing subtractions, and areas can collapse and lines alias one another due to lack of precision. Using 80-bit values vs. 64-bit values could help, I would imagine.

Is this incorrect? If not, what can we use to perform extended double FP operations if x87 is phased out?

Solution

The biggest problem with x87 is basically that all register operations are done in 80 bits, whereas most of the time people only use 64 bit floats (i.e. double-precision floats). What happens is, you load a 64 bit float into the x87 stack, and it gets converted to 80 bits. You do some operations on it in 80 bits, then store it back into memory, converting it into 64 bits. You will get a different result than if you had done all the operations with just 64 bits, and with an optimizing compiler it can be very unpredictable how many conversions a value might go through, so it's hard to verify that you're getting the "correct" answer when doing regression tests.

The other problem, which only matters from the point of view of someone writing assembly (or indirectly writing assembly, in the case of someone writing a code generator for a compiler), is that the x87 uses a register stack, whereas SSE uses individually accessible registers. With x87 you have a bunch of extra instructions to manipulate the stack, and I imagine Intel and AMD would rather make their processors run fast with SSE code than trying to make those extra stack-manipulation x87 instructions run fast.

BTW if you are having problems with inaccuracy, you will want to take a look at the article "What every programmer should know about floating-point arithmetic", and then maybe use an arbitrary precision math library (e.g. GMP) instead.