c assembly x86 floating-point microbenchmark

Why does this lookup table sine estimation perform worse when using float instead of double?

I've written a simple sine estimation function which uses a lookup table. Out of curiosity, I tried both float and double types, expecting float to perform a bit better because of being able to pack into the cache more easily. However, that's the opposite of what I'm seeing. The float based code executes in 7.715 seconds, whereas the double based code executes in 6.411 seconds. The absolute times aren't relevant, just that there is a significant and repeatable difference. Here is the lookup function:

double fastSinD(double x) {
    int sinIndex = x * (0.5f * SINE_TABLE_SIZE / M_PI);
    double delta = x - sinIndex * (2.f * M_PI / SINE_TABLE_SIZE);
    int cosIndex = sinIndex + SINE_TABLE_SIZE/4;

    sinIndex &= SINE_TABLE_SIZE-1;
    cosIndex &= SINE_TABLE_SIZE-1;

    double sinVal = sineTable_4q[sinIndex];
    double cosVal = sineTable_4q[cosIndex];

    return sinVal + (cosVal - 0.5f*sinVal*delta) * delta;
}

The float based lookup function is exactly identical, except that it uses the float lookup table and a float version of PI. I thought perhaps there were some hidden float to double conversions happening, but the assembly indicates that there isn't. Here is the asm for the double version:

fastSinD:
    .seh_endprologue
    vmulsd  .LC4(%rip), %xmm0, %xmm1
    leaq    sineTable_4q(%rip), %rcx
    vcvttsd2sil %xmm1, %eax
    vxorps  %xmm1, %xmm1, %xmm1
    leal    128(%rax), %edx
    vcvtsi2sdl  %eax, %xmm1, %xmm1
    andl    $511, %eax
    vfnmadd132sd    .LC5(%rip), %xmm0, %xmm1
    vmovsd  (%rcx,%rax,8), %xmm2
    movl    %edx, %eax
    vmulsd  .LC6(%rip), %xmm2, %xmm0
    andl    $511, %eax
    vfnmadd213sd    (%rcx,%rax,8), %xmm1, %xmm0
    vfmadd132sd %xmm1, %xmm2, %xmm0
    ret

And the float version:

fastSinF:
    .seh_endprologue
    vmulss  .LC7(%rip), %xmm0, %xmm1
    leaq    f_sineTable_4q(%rip), %rcx
    vcvttss2sil %xmm1, %eax
    vxorps  %xmm1, %xmm1, %xmm1
    leal    128(%rax), %edx
    vcvtsi2ssl  %eax, %xmm1, %xmm1
    andl    $511, %eax
    vfnmadd132ss    .LC8(%rip), %xmm0, %xmm1
    vmovss  (%rcx,%rax,4), %xmm2
    movl    %edx, %eax
    vmulss  .LC9(%rip), %xmm2, %xmm0
    andl    $511, %eax
    vfnmadd213ss    (%rcx,%rax,4), %xmm1, %xmm0
    vfmadd132ss %xmm1, %xmm2, %xmm0
    ret

They are completely identical, which leads me to believe that the double and float instructions somehow do not execute in the same number of cycles. That seems pretty unlikely though, so I'm stumped. Any ideas why this difference might exist? My CPU is a Ryzen 5 3600X if that's relevant.

Edit:

Per request in the comments, I decided to create some more benchmarks which include a simplified version of my original calling function. In doing so, I determined that most likely the entire slowdown was due to the M_PI constant defined in math.h being a double literal. Here is my expanded benchmark which includes calling function code:

void benchmarkCallingFunc() {
    struct timespec start, finish;

    int data[10000];
    float samples[10000];
    for (int i = 0; i < 10000; i++) {
        samples[i] = ((i*91) % 100000) / 1000.f;
    }

    clock_gettime(CLOCK_REALTIME, &start);
    double result = 0;
    for (int n = 0; n < 10; n++) {
        for (int k = 0; k < 10000; k++) {
            double sum = 0;

                for (int i = 0; i < 4095; i++) {
                    double dt = samples[(k+n)%10000]*10 - i;
                    float sinc = dt == 0 ? 1 : fastSinD(M_PI * dt) / (M_PI * dt);
                    sum += (double)data[(i*k+n) % 10000] * sinc;
                }

            result += sum;
        }
    }
    printf("%f\n", result);
    clock_gettime(CLOCK_REALTIME, &finish);

    printf("Double fast sin took %d milliseconds.\n", (finish.tv_sec - start.tv_sec)*1000 + (finish.tv_nsec - start.tv_nsec) / 1000000L);

    clock_gettime(CLOCK_REALTIME, &start);
    result = 0;
    for (int n = 0; n < 10; n++) {
        for (int k = 0; k < 10000; k++) {
            double sum = 0;

            for (int i = 0; i < 4095; i++) {
                float dt = samples[(k+n)%10000]*10 - i;
                float sinc = dt == 0 ? 1 : fastSinF(M_PI * dt) / (M_PI * dt);
                sum += (float)data[(i*k+n) % 10000] * sinc;
            }

            result += sum;
        }
    }
    printf("%f\n", result);
    clock_gettime(CLOCK_REALTIME, &finish);

    printf("Float fast sin took %d milliseconds.\n", (finish.tv_sec - start.tv_sec)*1000 + (finish.tv_nsec - start.tv_nsec) / 1000000L);
}

With the above code as is, I get ~1450ms for the doubles, and ~1870ms for the floats. However, if I simply replace M_PI in the float section with a float literal (3.14159265f), the float time is now 1242ms. I probably should have noticed that M_PI issue in my original calling code, but it's still interesting and surprising to me that it has such a massive performance impact. It's something like a 50% speedup just from switching M_PI to 3.141592f.

Edit2:

Providing my compilation command here for additional context: gcc -fdiagnostics-color=always -O3 -march=native benchmark.c -o benchmark.exe

Solution

The question of float vs double performance is indeed interesting and although on paper they do execute in exactly the same time for valid in range floating point number formats. There are other considerations in practice that mean for real world computation the floats do end up sometimes being slower. Most of the time the effect is a subtle edge effect at extremes of the dynamic range. Slowdown can be catastrophic if you end up with denormals persisting in the computation.

I chose a moderately high order polynomial as the example. Sufficient work to be done that it is easy to benchmark reliably and long enough to be outside the zone where the compiler does sneaky set piece code transformations.

The test harness is a simple as I could make it and switchable to be either float friendly or double friendly and the timing curves do swap around as expected. It is offered as work in progress for investigating float vs double performance issues and suggestions for improvements are welcome. There is possibly a small bias in favour of doubles remaining.

The 34 coefficients used for exp(x) are completely bonkers. Deliberately optimised for a range of 0 to 24 at 240 bit precision to make alternate terms change sign further out. It still gives pretty good answers in line with the target equal ripple precision for 0 < x < 2 (when evaluated in double precision with double coefficients).

The IEEE 754 floating point format determines the number of bits allocated to exponent and mantissa. The crucial point is that 1/256 of floats are NaNs and 1/2048 of doubles are NaNs and perhaps more importantly in terms of performance the same proportion of each are denormalised numbers (without the implicit leading 1 of normal FP) which on most modern architectures are either much slower (or optionally forced to zero depending on settings) than ordinary floating point numbers.

It is common for nominally float function implementation code to work internally in double precision and then truncate the answer to float at the end. It avoids having to worry about denorm float inputs. The 6th power of the smallest possible float denorm is still easily representable inside the normal numerical range of a double.

What really hurts floating point performance is when the highest power coefficient in a polynomial expansion is small enough that a[n]*x is a denormal number (or some other term is when summing an alternating series). This may not happen much in practice but when it does a slowdown is very evident - like performance falling off a cliff. These are the graphs of performance vs N number of coefficients used (NPOLY):

The performance curves for computing a polynomial of length N are for the form t = overheads + a*N + b*N^2 with typical values 0.38, 0.05 and 0.0033 respectively for normal operation.

The sample coefficients are chosen such that evalpoly_float32 enters denormal hell for NPOLY=34 so I have only included it on the first one graph. I'm a bit puzzled by the extent of the quadratic term which is present and to the same extent on both float and double. Set NPOLY to 33 instead and the slowdown vanishes.

The notable difference between the graphs is that when the test harness is set to favour float operations they are quicker and vice versa for doubles. However there is a slight time asymmetry for NPOLY+33 (where it is largest) the double performance is 0.5s faster than float whereas the float performance gain is less than 0.4s. This small difference is as yet unexplained (I suspect error on my part) but it is above measurement noise.

Here is the test harness code.

#include <iostream>
#include <time.h>
#include <immintrin.h>

// these are the experimental parameters for testing

//#define TEST32  // define for float argument and results from myfun
//#define DENORM  // define to cause denorm problems for float implementation
//#define BAD     // define to torment code with worst case smallest denorm (extremely slow!)
#define NPOLY (34)  // alter to reproduce graphs

#ifdef TEST32
#define mydouble float
#else
#define mydouble double
#endif



// coefficients of exp(x) optimised for 0 < x < 24 (only really convergent for modest values of x). Pathological example to highlight differences in implementation

const double p33_64[] = {0.999999103990271660,      1.000084861598846,        0.49866165515625655,      0.17508665464152611,        0.013416217760831450,      0.0669365720171555626,       -0.0808178401983054652,   0.0829888211308184115,    -0.06243990003233263068,   0.0364451206935418890218,
                         -0.01684238787668379099,   0.00628454669399201021,  -0.00192184480535264070,   0.000487501052042129623,   -0.000103569990566160962,   1.85712682561810737e-05,     -2.82769484712378898e-06, 3.672997686748407122e-07, -4.083754859966059115e-08, 3.8947467903391254762e-09,
                         -3.189318662795476291e-10, 2.24188877398102594e-11, -1.350740685008809765e-12, 6.9550165609442109660e-14, -3.0464734359053801115e-15, 1.127693385725939962188e-16, -3.49515434460312717e-18, 8.95528999573909913e-20,  -1.863381152930554034e-21, 3.0695198755360081883e-23,
                         -3.852964569561991321e-25, 3.46396211035506872e-27, -1.987923483445060738e-29, 5.4772346513326323929e-32 };

const float p33_32[] = {  0.999999104f,            1.00008486f,            0.498661655f,           0.17508665f,              0.0134162177f,           0.0669365720f,           -0.080817840f,            0.082988821f,             -0.0624399000f,            0.0364451207,
                         -0.0168423878f,           0.006284546694f,       -0.001921844805f,        0.00048750105f,          -0.0001035699906f,        1.85712683e-05f,         -2.82769485e-06f,         3.6729976867e-07f,        -4.08375486e-08f,          3.894746790e-09f,
                         -3.18931866e-10f,         2.241888774e-11f,      -1.350740685e-12f,       6.955016561e-14f,        -3.04647344e-15f,         1.12769339e-16f,         -3.49515434e-18f,         8.955289996e-20f,         -1.863381153e-21f,         3.069519875e-23f,
                        -3.852964570e-25f,        3.463962110e-27f,      -1.987923483e-29f,       5.477234651e-32f };

mydouble null(mydouble x, const double* a)
{
    return x;
}

mydouble evalpoly32_naive(mydouble x, const double* a)
{
    int i;
    float xn, sum = a[0];
    xn = 1.0f;
    for (i = 1; i < NPOLY; i++)
    {
        xn *= x;          // underflow and overflow both issues for higher powers
        sum += xn * a[i];
    }
    return (mydouble)sum;
}

mydouble evalpoly_loop(mydouble x, const double* a)
{
    int i = NPOLY;
    double sum = a[--i];
    do
        sum = sum * x + a[--i];
    while (i);
    return (mydouble) sum;
}

mydouble evalpoly32_loop(mydouble xin, const double* a)
{
    int i = NPOLY;
    float x, sum = p33_32[--i]; // direct access to 32 bit coeffs
    x = (float) xin;
    do
        sum = sum * x + p33_32[--i];
    while (i);
    return (mydouble) sum;
}

mydouble evalpoly64_loop(mydouble x, const double* a)
{
    int i = NPOLY;
    double sum = p33_64[--i]; // direct access to 64 bit coeffs
    do
        sum = sum * x + p33_64[--i];
    while (i);
    return (mydouble) sum;
}

float timefun( mydouble(*myfun)(mydouble, const double*))
{
    mydouble sum;
    float x, dx, ox;
    time_t tstart, tend;
    sum = 0.0f;
#ifdef DENORM
  #ifdef BAD
    dx = dx = 0x0.000004p-127f; // smallest float denorm **WARNING some routines take minutes to complete!
  #else
    dx = 0x1.000000p-64;  // all higher powers of x are denorm
  #endif
#else
    dx = 0x1.0p-54; // Goldilocks misbehaviour OK for NPOLY == 33 !
#endif
    tstart = clock();
        x = ox = 0.0f;
        while (x <= 64.0f)  // decrease this for faster runtimes and fewer overflows
        {
            sum += (*myfun)((mydouble) x, p33_64);
            if (ox == x) dx += dx;
            ox = x;
            x += dx;
        }
    tend = clock();
    printf(" %g", (float)(tend - tstart) / CLOCKS_PER_SEC);
    if (isinf(sum)) printf(" timing sum infinite!\n");
    if (isnan(sum)) printf(" timing sum is Nan\n");
    printf("\n"); // to make clear space in list

    return (float)(tend - tstart) / CLOCKS_PER_SEC;
}

void testfun(const char *name, mydouble (*myfun)(mydouble, const double*))
{
    mydouble x, dx, ox, err, ref, res, xmax, xmin, minerr, maxerr;
    static bool doheader = true;

    if (doheader)
    {
        printf("Routine name \t  Description \t  max_err \t x_max \t   min_err \t x_min \t    Time          \n");
            doheader = false;
    }
    xmin = xmax = minerr = maxerr = x = 0.0f;
    ox = -1.0;
    dx = 0x1.0p-24; // saf evalue
    {
        x = 0.0;
        while (x <= 2.0f)
        {
            ref = exp(x);
            res = (*myfun)(x, p33_64);
            err = ref - res;
            if (err > maxerr)
            {
                maxerr = err;
                xmax = x;
            }
            if (err < minerr)
            {
                minerr = err;
                xmin = x;
            }
            if (ox == x) dx += dx;
            ox = x;
            x += dx;
        }
    }
    printf("%-32s %-12g @ %-11g  %12g @ %11g  ", name, maxerr, xmax, minerr, xmin);
    timefun(myfun);
}

int main()
{
    if (sizeof(mydouble) == 4) printf("float benchmarks "); else printf("double benchmarks ");
    printf(" NPOLY = %i\n", NPOLY);
    testfun("null return x     overheads     ", null);
    testfun("evalpoly64_loop   double & loop ", evalpoly64_loop);
    testfun("evalpoly_loop     double & loop ", evalpoly_loop);
    testfun("evalpoly32_loop   float & loop  ", evalpoly32_loop);
    testfun("evalpol32_naive   double & float", evalpoly32_naive);
    testfun("null return x     overheads     ", null);
}

I also included the worst possible way of computing a polynomial evalpol32_naive as an example where how it goes horribly wrong is obvious. You might want to comment it out if you are doing serious tests. A better more comprehensive test harness for probing float functions was described by NJuffa in his post about erfc(x). I was really only concerned here with testing the functionality over a restricted range of valid real values and timing.

I did also learn some similarities and very curious distinctions between compilers. With these test data both produce hard inlined code but Intel makes a better fist of it here.

mydouble evalpoly_loop(mydouble x, const double* a)
{
    int i = NPOLY;
    double sum = a[--i];
00007FF67ABA1320 C5 FB 10 8A 08 01 00 00 vmovsd      xmm1,qword ptr [rdx+108h]  
    do
        sum = sum * x + a[--i];
00007FF67ABA1328 C5 FA 5A C0          vcvtss2sd   xmm0,xmm0,xmm0  
00007FF67ABA132C C4 E2 F9 A9 8A 00 01 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+100h]  
00007FF67ABA1335 C4 E2 F9 A9 8A F8 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0F8h]  
00007FF67ABA133E C4 E2 F9 A9 8A F0 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0F0h]  
00007FF67ABA1347 C4 E2 F9 A9 8A E8 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0E8h]  
00007FF67ABA1350 C4 E2 F9 A9 8A E0 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0E0h]  
00007FF67ABA1359 C4 E2 F9 A9 8A D8 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0D8h]  
00007FF67ABA1362 C4 E2 F9 A9 8A D0 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0D0h]  
00007FF67ABA136B C4 E2 F9 A9 8A C8 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0C8h]  
00007FF67ABA1374 C4 E2 F9 A9 8A C0 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0C0h]  
00007FF67ABA137D C4 E2 F9 A9 8A B8 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0B8h]  
00007FF67ABA1386 C4 E2 F9 A9 8A B0 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0B0h]  
00007FF67ABA138F C4 E2 F9 A9 8A A8 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0A8h]  
00007FF67ABA1398 C4 E2 F9 A9 8A A0 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+0A0h]  
00007FF67ABA13A1 C4 E2 F9 A9 8A 98 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+98h]  
00007FF67ABA13AA C4 E2 F9 A9 8A 90 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+90h]  
00007FF67ABA13B3 C4 E2 F9 A9 8A 88 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+88h]  
00007FF67ABA13BC C4 E2 F9 A9 8A 80 00 00 00 vfmadd213sd xmm1,xmm0,mmword ptr [rdx+80h]  
00007FF67ABA13C5 C4 E2 F9 A9 4A 78    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+78h]  
00007FF67ABA13CB C4 E2 F9 A9 4A 70    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+70h]  
00007FF67ABA13D1 C4 E2 F9 A9 4A 68    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+68h]  
00007FF67ABA13D7 C4 E2 F9 A9 4A 60    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+60h]  
00007FF67ABA13DD C4 E2 F9 A9 4A 58    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+58h]  
00007FF67ABA13E3 C4 E2 F9 A9 4A 50    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+50h]  
00007FF67ABA13E9 C4 E2 F9 A9 4A 48    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+48h]  
00007FF67ABA13EF C4 E2 F9 A9 4A 40    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+40h]  
00007FF67ABA13F5 C4 E2 F9 A9 4A 38    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+38h]  
00007FF67ABA13FB C4 E2 F9 A9 4A 30    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+30h]  
00007FF67ABA1401 C4 E2 F9 A9 4A 28    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+28h]  
00007FF67ABA1407 C4 E2 F9 A9 4A 20    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+20h]  
00007FF67ABA140D C4 E2 F9 A9 4A 18    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+18h]  
00007FF67ABA1413 C4 E2 F9 A9 4A 10    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+10h]  
00007FF67ABA1419 C4 E2 F9 A9 4A 08    vfmadd213sd xmm1,xmm0,mmword ptr [rdx+8]  
00007FF67ABA141F C4 E2 F9 A9 0A       vfmadd213sd xmm1,xmm0,mmword ptr [rdx]  
    while (i);
    return (mydouble) sum;
00007FF67ABA1424 C5 F3 5A C1          vcvtsd2ss   xmm0,xmm1,xmm1  
}

The corresponding code for MSVC uses vmulsd and vaddsd instead.

Intel ICX which I also use has DAZ (Denormals as Zero) enabled by default so the default execution times there are identical for the cases where denormal numbers slow processing on Microsoft C. I haven't tried it yet on GCC. Be interested to know what the results are.

I will need to obfuscate the compile time value of NPOLY to make it really generate a loop! (as such the routines are somewhat misnamed since they have all inlined)

Tests were conducted on MSVC 17.1 & ICX 2024.1. Maximum optimisation but no function inlining.

Sample timing results from my i5-12600 with TEST32 & DENORM set

function	MSVC 17.1	ICX 2024.1
null	0.53	1.27
evalpoly64	6.01	2.67
eval poly	6.03	2.65
evalpoly32	18.3	2.65
evalpoly_naive	48.7	21.1

Set denorm to BAD and the last one can take minutes to complete!

I will post a few questions arising from my exploration of this theme and about compiler code generation in due course. Some of what I found whilst trying to piece together this example float vs double performance code test is rather odd. I am now convinced that float and double arithmetic timings for +, - and * are absolutely identical for valid IEEE 754 numbers.

Division / is another matter though - where float really has the speed edge over double.