I'm benchmarking the overhead of GCC Profile-Guided Optimization on the SPEC benchmarks. I have some weird results with some benchmarks. Indeed, two of my benchmarks are running faster when instrumented.
The normal executable is compiled with: -g -O2 -march=native
The instrumented executable is compiled with: -g -O2 -march=native -fprofile-generate -fno-vpt
I'm using GCC 4.7 (The Google branch to be precise). The computer on which the benchmark is running has an Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz.
bwaves is a Fortran benchmark and libquantum
Here are the results:
bwaves-normal: 712.14
bwaves-instrumented: 697.22
=> ~2% faster
libquantum-normal: 463.88
libquantum-instrumented: 449.05
=> ~3.2% faster
I ran the benchmarks several times thinking that it could be a problem on ma machine but each time I confirmed them.
I would understand a very small overhead on some programs, but I don't see any reason for an improvement.
So my question is: How can the GCC instrumented executable be faster than the optimized normal one ?
Thanks
I can think of two possibilities, both relating to cache.
One is that the counter increment "warms" some important cache lines. Second is that adding the structures required by instrumentation causes some heavily used arrays or variables to fall into different cache lines.
Another issue is that profiling / increasing a counter doesn't have to happen every time in a for loop -- if there's no 'break' or 'return' in a loop, a compiler is allowed to optimize the increment out of the loop.