I program on Unix, using the g++ 4.8.2 compiler. I currently need to convert my C++ program, which at this point uses long double
(with a significand of 64 bits in my case), to a program which uses the __float128
type (with a significand of 113 bits). I used the libquadmath0
package and the boost library to do that, but the resulting program is 10~20 times slower than with long double
.
This is confusing since the size of the significand is not much higher, and I did not observe such a difference when switching from double
to long double
. Is this timing difference normal, and if no, how can I fix it?
The code:
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <math.h>
#include <complex.h>
extern "C" {
#include <quadmath.h>
}
#include <gmp.h>
#include <iomanip>
#include <cfloat>
#include <boost/multiprecision/float128.hpp>
using namespace boost::multiprecision;
using namespace std;
typedef __float128 long_double_t;
void main()
{
...
}
The compiling instructions:
g++ --std=c++11 main.cc -o main -lgmp -lquadmath -Ofast -m64
This is confusing since the size of the significand is not much higher, and I did not observe such a difference when switching from
double
tolong double
Take a simple example: use a 12-digit pocket calculator to add two 8-digit numbers and then add two 11-digit numbers. Do you see the difference? And now use that calculator to add two 23-digit numbers, which one do you think will be slower? Obviously the last one needs a lot more operations (and also space as you need to write intermediate results into paper)
In x86 you have hardware support for IEEE-754 single, double and 80-bit extended precision long double
so operations on those types is done completely in hardware which is typically just a single instruction. double + double
is no different from long double + long double
, which is the same FADD
instruction in x87. If you use SSE then double
will be a bit faster than long double
due to the use of the new SIMD registers and instructions
When you use __float128
however the compiler needs to use software emulation which is far slower. You can't add 2 long double
values with 2 instructions. You need to do everything manually:
Those steps include several branches (which may result in branch misprediction), memory loads/stores (because x86 doesn't have a lot of registers) and many more things that finally add up to at least tens of instructions. Doing those complex tasks just 10 times slower is already a great achievement. And we're still not coming to multiplication yet, which is 4 times as difficult when the significand width is doubled. Division, square root, exponentiation, trigonometry... are far more complicated and will be significantly slower