c++optimizationvalgrindkcachegrind

KCachegrind output for optimized vs unoptimized builds


I run valgrind --tool=callgrind ./executable on the executable file generated by the following code:

#include <cstdlib>
#include <stdio.h>
using namespace std;

class XYZ{
public:
    int Count() const {return count;}
    void Count(int val){count = val;}
private:
    int count;
};

int main() {
    XYZ xyz;
    xyz.Count(10000);
    int sum = 0;
    for(int i = 0; i < xyz.Count(); i++){
//My interest is to see how the compiler optimizes the xyz.Count() call
        sum += i;
    }
    printf("Sum is %d\n", sum);
    return 0;
}

I make a debug build with the following options: -fPIC -fno-strict-aliasing -fexceptions -g -std=c++14. The release build is with the following options: -fPIC -fno-strict-aliasing -fexceptions -g -O2 -std=c++14.

Running valgrind generates two dump files. When these files (one file for debug executable, the other for release executable) are viewed in KCachegrind, the debug build is understandable as shown below:

Debug Build

As expected, the function XYZ::Count() const is called 10001 times. However, the optimized release build is much harder to decipher and it is not clear how many times the function is called at all. I am aware that the function call might be inlined. But how does one figure out that it has infact been inlined? The callgraph for the release build is as shown below:

Release Build

There seems to be no indication of function XYZ::Count() const at all from main().

My questions are:

(1)Without looking at the assembly language code generated by the debug/release builds, and by using using KCachegrind, how can one figure out how many times a particular function, (in this case XYZ::Count() const) is called? In the release build call graph above, the function is not even called once.

(2)Is there a way to understand the callgraph and other details provided by KCachegrind for release/optimized builds? I have already looked at the KCachegrind manual available at https://docs.kde.org/trunk5/en/kdesdk/kcachegrind/kcachegrind.pdf, but I was wondering if there are some useful hacks/rules of thumb that one should look for in release builds.


Solution

  • The output of valgrind is easy to understand: As valgrind+kcachegrind are telling you, this function was not called at all in the release build.

    The question is, what do you mean by called? If a function is inlined, is it still "called"? Actually, the situation is more complex, as it seems at the first sight and your example isn't that trivial.

    Was Count() inlined in the release build? Sure, kind of. The code transformation during the optimization is often quite remarkable, like in your case - and the best way to judge, is to look into the resulting assembler (here for clang):

    main:                                   # @main
            pushq   %rax
            leaq    .L.str(%rip), %rdi
            movl    $49995000, %esi         # imm = 0x2FADCF8
            xorl    %eax, %eax
            callq   printf@PLT
            xorl    %eax, %eax
            popq    %rcx
            retq
    .L.str:
            .asciz  "Sum is %d\n"
    

    You can see, that the main doesn't execute the for-loop at all, but just prints the result (49995000), which is calculated during the optimization because the number of iterations is known during the compile-time.

    So was Count() inlined? Yes, somewhere during the first steps of optimization, but then the code became something completely different - there is no place where Count() was inlined in the final assembler.

    So what happens, when we "hide" the number of iteration from compiler? E.g. pass it via the command line:

    ...
    int main(int argc,  char* argv[]) {
       XYZ xyz;
       xyz.Count(atoi(argv[1]));
    ...
    

    In the resulting assembler, we still don't encounter a for-loop, because the optimizer can figure out, that the call of Count() doesn't have side-effect and optimizes the whole thing out:

    main:                                   # @main
            pushq   %rbx
            movq    8(%rsi), %rdi
            xorl    %ebx, %ebx
            xorl    %esi, %esi
            movl    $10, %edx
            callq   strtol@PLT
            testl   %eax, %eax
            jle     .LBB0_2
            leal    -1(%rax), %ecx
            leal    -2(%rax), %edx
            imulq   %rcx, %rdx
            shrq    %rdx
            leal    -1(%rax,%rdx), %ebx
    .LBB0_2:
            leaq    .L.str(%rip), %rdi
            xorl    %eax, %eax
            movl    %ebx, %esi
            callq   printf@PLT
            xorl    %eax, %eax
            popq    %rbx
            retq
    .L.str:
            .asciz  "Sum is %d\n"
    

    The optimizer came up with the formula (n-1)*(n-2)/2 for the sum i=0..n-1!

    Let's now hide the definition of Count() in an separate translation unit class.cpp, so the optimizer cannot see it's definition:

    class XYZ{
    public:
        int Count() const;//definition in separate translation unit
    ...
    

    Now we get our for-loop and a call to Count() in every iteration, the most important part of the assembler is:

    .L6:
            addl    %ebx, %ebp
            addl    $1, %ebx
    .L3:
            movq    %r12, %rdi
            call    XYZ::Count() const@PLT
            cmpl    %eax, %ebx
            jl      .L6
    

    The result of the Count() (in %rax) is compared to the current counter (in %ebx) in every iteration step. Now, if we run it with valgrind we can see in the list of callees, that XYZ::Count() was called 10001 times.

    However, for modern tool-chains it is not enough to see the assembler of the single translation units - there is a thing called link-time-optimization. We can use it by building somewhere along these lines:

    gcc -fPIC -g -O2 -flto -o class.o -c class.cpp
    gcc -fPIC -g -O2 -flto -o test.o  -c test.cpp
    gcc -g -O2 -flto -o test_r class.o test.o
    

    And running the resulting executable with valgrind we once again see, that Count() was not called!

    However looking into the machine code (here I used gcc, my clang-installation seems to have an issue with lto):

    00000000004004a0 <main>:
      4004a0:   48 83 ec 08             sub    $0x8,%rsp
      4004a4:   48 8b 7e 08             mov    0x8(%rsi),%rdi
      4004a8:   ba 0a 00 00 00          mov    $0xa,%edx
      4004ad:   31 f6                   xor    %esi,%esi
      4004af:   e8 bc ff ff ff          callq  400470 <strtol@plt>
      4004b4:   85 c0                   test   %eax,%eax
      4004b6:   7e 2b                   jle    4004e3 <main+0x43>
      4004b8:   89 c1                   mov    %eax,%ecx
      4004ba:   31 d2                   xor    %edx,%edx
      4004bc:   31 c0                   xor    %eax,%eax
      4004be:   66 90                   xchg   %ax,%ax
      4004c0:   01 c2                   add    %eax,%edx
      4004c2:   83 c0 01                add    $0x1,%eax
      4004c5:   39 c8                   cmp    %ecx,%eax
      4004c7:   75 f7                   jne    4004c0 <main+0x20>
      4004c9:   48 8d 35 a4 01 00 00    lea    0x1a4(%rip),%rsi        # 400674 <_IO_stdin_used+0x4>
      4004d0:   bf 01 00 00 00          mov    $0x1,%edi
      4004d5:   31 c0                   xor    %eax,%eax
      4004d7:   e8 a4 ff ff ff          callq  400480 <__printf_chk@plt>
      4004dc:   31 c0                   xor    %eax,%eax
      4004de:   48 83 c4 08             add    $0x8,%rsp
      4004e2:   c3                      retq   
      4004e3:   31 d2                   xor    %edx,%edx
      4004e5:   eb e2                   jmp    4004c9 <main+0x29>
      4004e7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
    

    We can see, that the call to the function Count() was inlined but - there is still a for-loop (I guess this is a gcc vs clang thing).

    But what is of most interest to you: the function Count() is "called" only once - its value is saved to register %ecx and the loop is actually only:

      4004c0:   01 c2                   add    %eax,%edx
      4004c2:   83 c0 01                add    $0x1,%eax
      4004c5:   39 c8                   cmp    %ecx,%eax
      4004c7:   75 f7                   jne    4004c0 <main+0x20>
    

    This all you could also see with help of Kcachegrid, if valgrind were run with option `--dump-instr=yes.