How do you benchmark a function? Looking at results from callgrind, I have found that my program spends a lot of time in pow
. Since I do not need full working precision, I thought that I could create a look-up-table and use linear interpolation between the points in the table. To be able to evaluate the look-up-table approach, I need to measure time. So I did this:
#ifdef __WAND__
target[name[test2.exe] type[application] platform[;Windows]]
target[name[test2] type[application]]
#endif
#include <herbs/main/main.h>
#include <herbs/tictoc/tictoc.h>
#include <herbs/array_fixedsize/array_fixedsize.h>
#include <random>
#include <cstdio>
#include <cmath>
class GetRand
{
public:
GetRand(double min,double max):U(min,max){}
bool operator()(double* val,size_t n,size_t N)
{
*val=U(randsource);
return 1;
}
private:
std::mt19937 randsource;
std::uniform_real_distribution<double> U;
};
int MAIN(int argc,charsys_t* argv[])
{
Herbs::ArrayFixedsize<double> vals(1024*1024*128,GetRand(-4,4));
const size_t N=16;
auto n=N;
while(n)
{
double start=0;
auto ptr=vals.begin();
{
Herbs::TicToc timestamp(start);
while(ptr!=vals.end())
{
pow(2,*ptr);
++ptr;
}
}
// I have set cpu-freq to 1.6 GHz using cpufreq-set
printf("%.15g\t",1.6e9*start/vals.length());
--n;
}
return 0;
}
When running this program The output is about 2.25 cycles per iteration. This seems very low, since the implementation of pow
seems to be (it callgrind
gave me __ieee754_pow
).
The benchmark loop in assembly looks like this when compiling for GNU/Linux on x86-64:
call _ZN5Herbs6TicTocC1ERd@PLT
movq %r14, %rbx
.p2align 4,,10
.p2align 3
.L28:
vmovsd (%rbx), %xmm1
vucomisd .LC6(%rip), %xmm1
jb .L25
vmovsd .LC7(%rip), %xmm0
call pow@PLT
.L25:
addq $8, %rbx
cmpq %r12, %rbx
jne .L28
movq %rbp, %rdi
call _ZN5Herbs6TicTocD1Ev@PLT
At least pow
is called. Can I trust the output or is there some black magic that eliminates things.
There are few things you need to consider when benchmarking functions.
1) Make sure that cache misses doesn't influence the results significantly. In your case you iterate through a large array of data where you get tons of cache misses. Use a smaller array instead which easily fits in L1 cache and loop through it several times.
2) Make sure that you have side effects from function calls you are profiling that compiler can't optimize these calls out. In your case compiler doesn't do a good job optimizing since pow()
calls are not optimized out even though there are no side effects. Prefer using integer side effects to avoid anomalies in floating point performance (e.g. raw-cast float to uint32 and add them up instead of doing the addition with floats).
3) Unroll your loops few times to reduce the overhead of looping. Currently you perform only single pow per loop where the loop adds relatively large overhead for this simple function call.
4) Profile with full optimizations and inlining enabled.
5) Run profiling through multiple times to ensure that other processes doesn't influence your results. Pick the best result for comparison (i.e. least amount of interference from other processes).