performancec++11x86floating-pointdivision

Why float division is faster than integer division in c++?


Consider the following code snippet in C++ :(visual studio 2015)

First Block

const int size = 500000000;
int sum =0;
int *num1 = new int[size];//initialized between 1-250
int *num2 = new int[size];//initialized between 1-250
for (int i = 0; i < size; i++)
{
    sum +=(num1[i] / num2[i]);
}

Second Block

const int size = 500000000;
int sum =0;
float *num1 = new float [size]; //initialized between 1-250 
float *num2 = new float [size]; //initialized between 1-250
for (int i = 0; i < size; i++)
{
    sum +=(num1[i] / num2[i]);
}

I expected that first block runs faster because it is integer operation . But the Second block is considerably faster , although it is floating point operation . here is results of my bench mark : Division:

Type    Time
uint8   879.5ms
uint16  885.284ms
int     982.195ms
float   654.654ms

As well as floating point multiplication is faster than integer multiplication. here is results of my bench mark :

Multiplication:

Type    Time
uint8   166.339ms
uint16  524.045ms
int     432.041ms
float   402.109ms

My system spec: CPU core i7-7700 ,Ram 64GB,Visual studio 2015


Solution

  • Floating point number division is faster than integer division because of the exponent part in floating point number representation. Division of exponent parts of floating point number value representations requires only a relatively cheap fixed-cost subtraction.

    int32_t division requires fast division of 31-bit numbers, whereas float division requires fast division of 24-bit mantissas (the leading one in mantissa is implied and not stored in a floating point number) and faster subtraction of 8-bit exponents.

    The fast division computes its result iteratively. It takes a fixed number of iterations (upper bound) to compute for any specific bit-width of operands, going from low to high significant bits of the numerator. When an intermediate denominator in that loop turns to 1 it ends the compute loop, hopefully, earlier than that fixed upper bound of the number of iterations, hence fast division. CPU division instructions have latency range instead of fixed latency for this reason.

    See an excellent detailed explanation how division is performed in CPU.


    Your benchmark code snippet is a plain load-modify-store loop, which is the ideal extreme best-case scenario for compilers to vectorize with SIMD instructions. Hence, you should keep in mind that your time measurements don't extrapolate or generalize to more involved loops.