cgccoptimizationlikely-unlikely

learning sample of likely() and unlikely() compiler hints


How can I demonstrate for students the usability of likely and unlikely compiler hints (__builtin_expect)?

Can you write an sample code, which will be several times faster with these hints comparing the code without hints.


Solution

  • Here is the one I use, a really inefficient implementation of the Fibonacci numbers:

    #include <stdio.h>
    #include <inttypes.h>
    #include <time.h>
    #include <assert.h>
    
    #define likely(x) __builtin_expect((x),1)
    #define unlikely(x) __builtin_expect((x),0)
    
    uint64_t fib(uint64_t n)
    {
        if (opt(n == 0 || n == 1)) {
            return n;
        } else {
            return fib(n - 2) + fib(n - 1);
        }
    }
    
    int main(int argc, char **argv)
    {
        int i, max = 45;
        clock_t tm;
    
        if (argc == 2) {
            max = atoi(argv[1]);
            assert(max > 0);
        } else {
            assert(argc == 1);
        }
    
        tm = -clock();
        for (i = 0; i <= max; ++i)
            printf("fib(%d) = %" PRIu64 "\n", i, fib(i));
        tm += clock();
    
        printf("Time elapsed: %.3fs\n", (double)tm / CLOCKS_PER_SEC);
        return 0;
    }
    

    To demonstrate, using GCC:

    ~% gcc -O2 -Dopt= -o test-nrm test.c
    ~% ./test-nrm
    ...
    fib(45) = 1134903170
    Time elapsed: 34.290s
    
    ~% gcc -O2 -Dopt=unlikely -o test-opt test.c
    ~% ./test-opt
    ...
    fib(45) = 1134903170
    Time elapsed: 33.530s
    

    A few hundred milliseconds less. This gain is due to the programmer-aided branch prediction.

    But now, for what the programmer should really be doing instead:

    ~% gcc -O2 -Dopt= -fprofile-generate -o test.prof test.c
    ~% ./test.prof 
    ...
    fib(45) = 1134903170
    Time elapsed: 77.530s  /this run is slowed down by profile generation.
    
    ~% gcc -O2 -Dopt= -fprofile-use -o test.good test.c
    ~% ./test.good
    fib(45) = 1134903170
    Time elapsed: 17.760s
    

    With compiler-aided runtime profiling, we managed to reduce from the original 34.290s to 17.760s. Much better than with programmer-aided branch prediction!