cgcclibm

Why is atan first call much slower than the next ones?


The following code demonstrates that atan computation time can vary a lot:

#include <cstdio>
#include <cstdlib>
#include <cmath>

#include <sys/time.h>
#include <sys/resource.h>
#include <unistd.h>

double get_time()
{
    struct timeval t;
    struct timezone tzp;
    gettimeofday(&t, &tzp);
    return t.tv_sec + t.tv_usec*1e-6;
}

int main() {
    double worst_time = 0.0;
    double best_time = 1e6;

    volatile double x0 = -M_PI/2.0;
    volatile double foo = atan(x0); // SLOW CALL HERE
    volatile double sum = 0.0; // volatile to avoid having tan() call optimized away
    for (double x = x0; x < M_PI/3.0; x += 0.1) {
        volatile double y = x;
        const double start = get_time();
        asm volatile ("":::"memory"); // avoid reordering in -O3
        const double value = atan(y);
        asm volatile ("":::"memory"); // avoid reordering
        const double end = get_time();
        sum += value;

        const double delta = end - start;
        if (delta > worst_time) {
            worst_time = delta;
        }
        if (delta < best_time) {
            best_time = delta;
        }
        printf("* %f (value: %f)\n", delta, y);
    }

    printf("%f / %f\n", worst_time, best_time);

    printf("%f\n", foo);
}

From my machine worst time is around 15us whereas the best time is 0 (too small to be measured).

The average time (not displayed here) on my machine is around 1 or 2 us.

I tried different compilation flags (-O3, linking statically to libm, etc.) but I cannot find what causes the worst time to be much slower. Any idea?

edit: I am using Ubuntu 14.04 - gcc 4.8.4

edit2: replace atan2 by atan. I am not interested by the fact that atan2 is defined piece-wise and different branches may take different times. I am interested in eliminating the outliers which can appear even if atan is called instead of atan2.

edit3:

* 0.000015 (value: -1.570796)
* 0.000000 (value: -1.470796)
* 0.000001 (value: -1.370796)
* 0.000001 (value: -1.270796)
* 0.000000 (value: -1.170796)
* 0.000002 (value: -1.070796)
* 0.000000 (value: -0.970796)
* 0.000001 (value: -0.870796)
* 0.000000 (value: -0.770796)
* 0.000000 (value: -0.670796)
* 0.000001 (value: -0.570796)
* 0.000000 (value: -0.470796)
* 0.000003 (value: -0.370796)
* 0.000001 (value: -0.270796)
* 0.000000 (value: -0.170796)
* 0.000000 (value: -0.070796)
* 0.000001 (value: 0.029204)
* 0.000000 (value: 0.129204)
* 0.000002 (value: 0.229204)
* 0.000001 (value: 0.329204)
* 0.000000 (value: 0.429204)
* 0.000001 (value: 0.529204)
* 0.000001 (value: 0.629204)
* 0.000001 (value: 0.729204)
* 0.000001 (value: 0.829204)
* 0.000001 (value: 0.929204)
* 0.000000 (value: 1.029204)
0.000015 / 0.000000 / 0.000001

edit4:

It appears that the first call is the culprit! The call outside the loop was optimized away by the compiler, if we force atan to be evaluated outside of the loop for x0, all the calls are reasonably fast...

* 0.000000 (value: -1.570796)
* 0.000001 (value: -1.470796)
* 0.000000 (value: -1.370796)
* 0.000002 (value: -1.270796)
* 0.000001 (value: -1.170796)
* 0.000001 (value: -1.070796)
* 0.000000 (value: -0.970796)
* 0.000000 (value: -0.870796)
* 0.000000 (value: -0.770796)
* 0.000001 (value: -0.670796)
* 0.000000 (value: -0.570796)
* 0.000000 (value: -0.470796)
* 0.000006 (value: -0.370796)
* 0.000001 (value: -0.270796)
* 0.000002 (value: -0.170796)
* 0.000001 (value: -0.070796)
* 0.000000 (value: 0.029204)
* 0.000001 (value: 0.129204)
* 0.000003 (value: 0.229204)
* 0.000000 (value: 0.329204)
* 0.000000 (value: 0.429204)
* 0.000000 (value: 0.529204)
* 0.000001 (value: 0.629204)
* 0.000000 (value: 0.729204)
* 0.000000 (value: 0.829204)
* 0.000000 (value: 0.929204)
* 0.000000 (value: 1.029204)
0.000006 / 0.000000

https://ideone.com/vtUuE6


Solution

  • The timing difference is actually caused by pages faults (!). The first time the function is called, the page containing atan2 code is accessed and a page fault occurs. Using mlockall() should improve the situation.