c++gccassemblybenchmarkingcallgrind

Benchmark a single function


How do you benchmark a function? Looking at results from callgrind, I have found that my program spends a lot of time in pow. Since I do not need full working precision, I thought that I could create a look-up-table and use linear interpolation between the points in the table. To be able to evaluate the look-up-table approach, I need to measure time. So I did this:

#ifdef __WAND__
target[name[test2.exe] type[application] platform[;Windows]]
target[name[test2] type[application]]
#endif

#include <herbs/main/main.h>
#include <herbs/tictoc/tictoc.h>
#include <herbs/array_fixedsize/array_fixedsize.h>
#include <random>
#include <cstdio>
#include <cmath>

class GetRand
    {
    public:
        GetRand(double min,double max):U(min,max){}

        bool operator()(double* val,size_t n,size_t N)
            {
            *val=U(randsource);
            return 1;
            }

    private:
        std::mt19937 randsource;
        std::uniform_real_distribution<double> U;
    };

int MAIN(int argc,charsys_t* argv[])
    {
    Herbs::ArrayFixedsize<double> vals(1024*1024*128,GetRand(-4,4));

    const size_t N=16;
    auto n=N;
    while(n)
        {
        double start=0;
        auto ptr=vals.begin();
            {
            Herbs::TicToc timestamp(start);
            while(ptr!=vals.end())
                {
                pow(2,*ptr);
                ++ptr;
                }
            }
    //  I have set cpu-freq to 1.6 GHz using cpufreq-set
        printf("%.15g\t",1.6e9*start/vals.length());
        --n;
        }
    return 0;
    }

When running this program The output is about 2.25 cycles per iteration. This seems very low, since the implementation of pow seems to be (it callgrind gave me __ieee754_pow).

The benchmark loop in assembly looks like this when compiling for GNU/Linux on x86-64:

    call    _ZN5Herbs6TicTocC1ERd@PLT
    movq    %r14, %rbx
    .p2align 4,,10
    .p2align 3
.L28:
    vmovsd  (%rbx), %xmm1
    vucomisd    .LC6(%rip), %xmm1
    jb  .L25
    vmovsd  .LC7(%rip), %xmm0
    call    pow@PLT
.L25:
    addq    $8, %rbx
    cmpq    %r12, %rbx
    jne .L28
    movq    %rbp, %rdi
    call    _ZN5Herbs6TicTocD1Ev@PLT

At least powis called. Can I trust the output or is there some black magic that eliminates things.


Solution

  • There are few things you need to consider when benchmarking functions.

    1) Make sure that cache misses doesn't influence the results significantly. In your case you iterate through a large array of data where you get tons of cache misses. Use a smaller array instead which easily fits in L1 cache and loop through it several times.

    2) Make sure that you have side effects from function calls you are profiling that compiler can't optimize these calls out. In your case compiler doesn't do a good job optimizing since pow() calls are not optimized out even though there are no side effects. Prefer using integer side effects to avoid anomalies in floating point performance (e.g. raw-cast float to uint32 and add them up instead of doing the addition with floats).

    3) Unroll your loops few times to reduce the overhead of looping. Currently you perform only single pow per loop where the loop adds relatively large overhead for this simple function call.

    4) Profile with full optimizations and inlining enabled.

    5) Run profiling through multiple times to ensure that other processes doesn't influence your results. Pick the best result for comparison (i.e. least amount of interference from other processes).