I run valgrind --tool=callgrind ./executable
on the executable file generated by the following code:
#include <cstdlib>
#include <stdio.h>
using namespace std;
class XYZ{
public:
int Count() const {return count;}
void Count(int val){count = val;}
private:
int count;
};
int main() {
XYZ xyz;
xyz.Count(10000);
int sum = 0;
for(int i = 0; i < xyz.Count(); i++){
//My interest is to see how the compiler optimizes the xyz.Count() call
sum += i;
}
printf("Sum is %d\n", sum);
return 0;
}
I make a debug
build with the following options: -fPIC -fno-strict-aliasing -fexceptions -g -std=c++14
. The release
build is with the following options: -fPIC -fno-strict-aliasing -fexceptions -g -O2 -std=c++14
.
Running valgrind generates two dump files. When these files (one file for debug executable, the other for release executable) are viewed in KCachegrind, the debug build is understandable as shown below:
As expected, the function XYZ::Count() const
is called 10001 times. However, the optimized release build is much harder to decipher and it is not clear how many times the function is called at all. I am aware that the function call might be inlined
. But how does one figure out that it has infact been inlined? The callgraph for the release build is as shown below:
There seems to be no indication of function XYZ::Count() const
at all from main()
.
My questions are:
(1)Without looking at the assembly language code generated by the debug/release builds, and by using using KCachegrind, how can one figure out how many times a particular function, (in this case XYZ::Count() const
) is called? In the release build call graph above, the function is not even called once.
(2)Is there a way to understand the callgraph and other details provided by KCachegrind for release/optimized builds? I have already looked at the KCachegrind manual available at https://docs.kde.org/trunk5/en/kdesdk/kcachegrind/kcachegrind.pdf, but I was wondering if there are some useful hacks/rules of thumb that one should look for in release builds.
The output of valgrind is easy to understand: As valgrind+kcachegrind are telling you, this function was not called at all in the release build.
The question is, what do you mean by called? If a function is inlined, is it still "called"? Actually, the situation is more complex, as it seems at the first sight and your example isn't that trivial.
Was Count()
inlined in the release build? Sure, kind of. The code transformation during the optimization is often quite remarkable, like in your case - and the best way to judge, is to look into the resulting assembler (here for clang):
main: # @main
pushq %rax
leaq .L.str(%rip), %rdi
movl $49995000, %esi # imm = 0x2FADCF8
xorl %eax, %eax
callq printf@PLT
xorl %eax, %eax
popq %rcx
retq
.L.str:
.asciz "Sum is %d\n"
You can see, that the main
doesn't execute the for-loop at all, but just prints the result (49995000
), which is calculated during the optimization because the number of iterations is known during the compile-time.
So was Count()
inlined? Yes, somewhere during the first steps of optimization, but then the code became something completely different - there is no place where Count()
was inlined in the final assembler.
So what happens, when we "hide" the number of iteration from compiler? E.g. pass it via the command line:
...
int main(int argc, char* argv[]) {
XYZ xyz;
xyz.Count(atoi(argv[1]));
...
In the resulting assembler, we still don't encounter a for-loop, because the optimizer can figure out, that the call of Count()
doesn't have side-effect and optimizes the whole thing out:
main: # @main
pushq %rbx
movq 8(%rsi), %rdi
xorl %ebx, %ebx
xorl %esi, %esi
movl $10, %edx
callq strtol@PLT
testl %eax, %eax
jle .LBB0_2
leal -1(%rax), %ecx
leal -2(%rax), %edx
imulq %rcx, %rdx
shrq %rdx
leal -1(%rax,%rdx), %ebx
.LBB0_2:
leaq .L.str(%rip), %rdi
xorl %eax, %eax
movl %ebx, %esi
callq printf@PLT
xorl %eax, %eax
popq %rbx
retq
.L.str:
.asciz "Sum is %d\n"
The optimizer came up with the formula (n-1)*(n-2)/2
for the sum i=0..n-1
!
Let's now hide the definition of Count()
in an separate translation unit class.cpp
, so the optimizer cannot see it's definition:
class XYZ{
public:
int Count() const;//definition in separate translation unit
...
Now we get our for-loop and a call to Count()
in every iteration, the most important part of the assembler is:
.L6:
addl %ebx, %ebp
addl $1, %ebx
.L3:
movq %r12, %rdi
call XYZ::Count() const@PLT
cmpl %eax, %ebx
jl .L6
The result of the Count()
(in %rax
) is compared to the current counter (in %ebx
) in every iteration step. Now, if we run it with valgrind we can see in the list of callees, that XYZ::Count()
was called 10001
times.
However, for modern tool-chains it is not enough to see the assembler of the single translation units - there is a thing called link-time-optimization
. We can use it by building somewhere along these lines:
gcc -fPIC -g -O2 -flto -o class.o -c class.cpp
gcc -fPIC -g -O2 -flto -o test.o -c test.cpp
gcc -g -O2 -flto -o test_r class.o test.o
And running the resulting executable with valgrind we once again see, that Count()
was not called!
However looking into the machine code (here I used gcc, my clang-installation seems to have an issue with lto):
00000000004004a0 <main>:
4004a0: 48 83 ec 08 sub $0x8,%rsp
4004a4: 48 8b 7e 08 mov 0x8(%rsi),%rdi
4004a8: ba 0a 00 00 00 mov $0xa,%edx
4004ad: 31 f6 xor %esi,%esi
4004af: e8 bc ff ff ff callq 400470 <strtol@plt>
4004b4: 85 c0 test %eax,%eax
4004b6: 7e 2b jle 4004e3 <main+0x43>
4004b8: 89 c1 mov %eax,%ecx
4004ba: 31 d2 xor %edx,%edx
4004bc: 31 c0 xor %eax,%eax
4004be: 66 90 xchg %ax,%ax
4004c0: 01 c2 add %eax,%edx
4004c2: 83 c0 01 add $0x1,%eax
4004c5: 39 c8 cmp %ecx,%eax
4004c7: 75 f7 jne 4004c0 <main+0x20>
4004c9: 48 8d 35 a4 01 00 00 lea 0x1a4(%rip),%rsi # 400674 <_IO_stdin_used+0x4>
4004d0: bf 01 00 00 00 mov $0x1,%edi
4004d5: 31 c0 xor %eax,%eax
4004d7: e8 a4 ff ff ff callq 400480 <__printf_chk@plt>
4004dc: 31 c0 xor %eax,%eax
4004de: 48 83 c4 08 add $0x8,%rsp
4004e2: c3 retq
4004e3: 31 d2 xor %edx,%edx
4004e5: eb e2 jmp 4004c9 <main+0x29>
4004e7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
We can see, that the call to the function Count()
was inlined but - there is still a for-loop (I guess this is a gcc vs clang thing).
But what is of most interest to you: the function Count()
is "called" only once - its value is saved to register %ecx
and the loop is actually only:
4004c0: 01 c2 add %eax,%edx
4004c2: 83 c0 01 add $0x1,%eax
4004c5: 39 c8 cmp %ecx,%eax
4004c7: 75 f7 jne 4004c0 <main+0x20>
This all you could also see with help of Kcachegrid, if valgrind were run with option `--dump-instr=yes.