c++x86floating-pointlong-doublequadruple-precision

Why is my C++ program so slow when switching from long double to float128?


I program on Unix, using the g++ 4.8.2 compiler. I currently need to convert my C++ program, which at this point uses long double (with a significand of 64 bits in my case), to a program which uses the __float128 type (with a significand of 113 bits). I used the libquadmath0 package and the boost library to do that, but the resulting program is 10~20 times slower than with long double.

This is confusing since the size of the significand is not much higher, and I did not observe such a difference when switching from double to long double. Is this timing difference normal, and if no, how can I fix it?

The code:

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <math.h>
#include <complex.h>
extern "C" {
#include <quadmath.h>
}
#include <gmp.h>
#include <iomanip>
#include <cfloat>
#include <boost/multiprecision/float128.hpp>


using namespace boost::multiprecision;
using namespace std;

typedef __float128 long_double_t;

void main()
{
...
}

The compiling instructions:

g++ --std=c++11 main.cc -o main -lgmp -lquadmath -Ofast -m64

Solution

  • This is confusing since the size of the significand is not much higher, and I did not observe such a difference when switching from double to long double

    Take a simple example: use a 12-digit pocket calculator to add two 8-digit numbers and then add two 11-digit numbers. Do you see the difference? And now use that calculator to add two 23-digit numbers, which one do you think will be slower? Obviously the last one needs a lot more operations (and also space as you need to write intermediate results into paper)

    In x86 you have hardware support for IEEE-754 single, double and 80-bit extended precision long double so operations on those types is done completely in hardware which is typically just a single instruction. double + double is no different from long double + long double, which is the same FADD instruction in x87. If you use SSE then double will be a bit faster than long double due to the use of the new SIMD registers and instructions

    When you use __float128 however the compiler needs to use software emulation which is far slower. You can't add 2 long double values with 2 instructions. You need to do everything manually:

    Those steps include several branches (which may result in branch misprediction), memory loads/stores (because x86 doesn't have a lot of registers) and many more things that finally add up to at least tens of instructions. Doing those complex tasks just 10 times slower is already a great achievement. And we're still not coming to multiplication yet, which is 4 times as difficult when the significand width is doubled. Division, square root, exponentiation, trigonometry... are far more complicated and will be significantly slower