c++dockerperflibstdc++flamegraph

How can you get frame-pointer perf call stacks/flamegraphs involving the C++ standard library?


I like the fp method for collecting call stacks with perf record since it's lightweight and less complex than dwarf. However, when I look at the call stacks/flamegraphs I get when a program uses the C++ standard library, they are not correct.

Here is a test program:

#include <algorithm>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

int __attribute__((noinline)) stupid_factorial(int x) {
    std::vector<std::string> xs;
    // Need to convert numbers to strings or it will all get inlined
    for (int i = 0; i < x; ++i) {
        std::stringstream ss;
        ss << std::setw(4) << std::setfill('0') << i;
        xs.push_back(ss.str());
    }
    int res = 1;
    while(std::next_permutation(xs.begin(), xs.end())) {
        res += 1;
    };
    return res;
}

int main() {
    std::cout << stupid_factorial(11) << "\n";
}

And here is the flame graph:

enter image description here

It was generated by the following steps on Ubuntu 20.04 in a Docker container:

g++ -Wall -O3 -g -fno-omit-frame-pointer program.cpp -o 6_stl.bin
# Make sure you have libc6-prof and libstdc++6-9-dbg installed
env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/debug:${LD_LIBRARY_PATH} perf record -F 1000 --call-graph fp -- ./6_stl.bin
# Make sure you have https://github.com/jonhoo/inferno installed
perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg

The main thing that's wrong with this is that not all functions are children of stupid_factorial, e.g. __memcmp_avx2_movbe. With dwarf, they are. In more complex programs, I have even seen functions like these being outside main. __dynamic_cast for instance is one that often has no parent.

In gdb, I always see correct backtraces, including for the functions that do not appear correctly here. Is it possible to get correct fp call stacks with libstdc++ without compiling it myself (which seems like a lot of work)?

There are also other oddities, though I couldn't reproduce them in Ubuntu 18.04 (outside the Docker container):


Solution

  • With your code, 20.04 x86_64 ubuntu, perf record --call-graph fp with and without -e cycles:u I have similar flamegraph as viewed with https://speedscope.app (prepare data with perf script > out.txt and select out.txt in the webapp).

    Is it possible to get correct fp call stacks with libstdc++ without compiling it myself (which seems like a lot of work)?

    No, call-graph method 'fp' is implemented in linux kernel code in very simple way: https://elixir.bootlin.com/linux/v5.4/C/ident/perf_callchain_user - https://elixir.bootlin.com/linux/v5.4/source/arch/x86/events/core.c#L2464

    perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
    { 
        ...
        fp = (unsigned long __user *)regs->bp;
        perf_callchain_store(entry, regs->ip);
        ...
        // where max_stack is probably around 127 = PERF_MAX_STACK_DEPTH     https://elixir.bootlin.com/linux/v5.4/source/include/uapi/linux/perf_event.h#L1021
        while (entry->nr < entry->max_stack) {
            ...
            if (!valid_user_frame(fp, sizeof(frame)))
                break;
            bytes = __copy_from_user_nmi(&frame.next_frame, fp, sizeof(*fp));
            bytes = __copy_from_user_nmi(&frame.return_address, fp + 1, sizeof(*fp));
    
            perf_callchain_store(entry, frame.return_address);
            fp = (void __user *)frame.next_frame;
        }
    }
    

    It can't find correct frames for -fomit-frame-pointer compiled code.

    For incorrect call stacks with main -> __memcmp_avx2_movbe there is only call stack data generated by kernel in perf.data file, no copy of user stack fragment, no register data:

    setarch x86_64 -R env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/debug:${LD_LIBRARY_PATH} perf record -F 1000 --call-graph fp  -- ./6_stl.bin
    perf script -D | less
    
    869122666352078 0xae0 [0x58]: PERF_RECORD_SAMPLE(IP, 0x4002): 12267/12267: 0x7ffff7d51670 period: 2332683 addr: 0
    ... FP chain: nr:5
    .....  0: fffffffffffffe00
    .....  1: 00007ffff7d51670
    .....  2: 0000555555556452
    .....  3: 00007ffff7be90fb
    .....  4: 00005555555564de
     ... thread: 6_stl.bin:12267
     ...... dso: /usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so
    6_stl.bin 12267 869122.666352:    2332683 cycles: 
                7ffff7d51670 __memcmp_avx2_movbe+0x140 (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
                555555556452 main+0x12 (/home/user/so/68259699/6_stl.bin)
                7ffff7be90fb __libc_start_main+0x10b (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
                5555555564de _start+0x2e (/home/user/so/68259699/6_stl.bin)
    

    So, with this method user-space perf tool can't use any additional information to fix the call stack. With dwarf method there are registers and partial dump of user stack data on every sample event.

    Gdb has full access to live process and can use any information, all registers, read any amount of user process stack, read additional debug info for program and libraries. And doing advanced and slow backtrace in gdb is not limited by time or security or uninterruptible context. Linux kernel should record perf sample in small time, it can't access swapped data or debug sections or debug info files, it should not do complex parsing (which can have some bugs).

    Debug version of libstdc++ may help (sudo apt install libstdc++6-9-dbg), but it is slow. And it did not help me to find lost backtrace of this asm-implemented __memcmp_avx2_movbe (libc: sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S)

    If you want full backtrace, I think you should find how to recompile a world (or only all libraries used by your target application). Probably it will be easier not with Ubuntu but with something like gentoo or arch or apline?

    If you are interested only in performance why do you want the flamegraph? Flat profile will catch most performance data; non-ideal flamegraph can be useful too.