I used the following command to extract backtraces leading to user level L3-misses
in a simple evince
benchmark:
sudo perf record -d --call-graph dwarf -c 10000 -e mem_load_uops_retired.l3_miss:uppp /opt/evince-3.28.4/bin/evince
As it is clear, the sampling period is quite large (10000 events between consecutive samples). For this experiment, the output of perf script
had some samples similar to this one:
EvJobScheduler 27529 26441.375932: 10000 mem_load_uops_retired.l3_miss:uppp: 7fffcd5d8ec0 5080022 N/A|SNP N/A|TLB N/A|LCK N/A
7ffff17bec7f bits_image_fetch_separable_convolution_affine+0x2df (inlined)
7ffff17bec7f bits_image_fetch_separable_convolution_affine_pad_x8r8g8b8+0x2df (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
7ffff17d1fd1 general_composite_rect+0x301 (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
ffffffffffffffff [unknown] ([unknown])
At the bottom of the backtrace, there is a symbol called [unknown]
, which seems OK. But then a line in general_composite_rect()
is called. Is this backtrace OK?
AFAIK, the first caller in the backtrace should be something like _start()
or __GI___clone()
. But the backtrace is not in this form. What is wrong?
Is there any way to resolve the issue? Are the truncated (parts of) backtraces reliable?
TL;DR perf backtracing process may stop at some function if there is no frame pointer saved in the stack or no CFI tables for dwarf method. Recompile libraries with -fno-omit-frame-pointer
or with -g
or get debuginfo. With release binaries and libs perf often will stop backtrace early without chance to reach main()
or _start
or clone()/start_thread()
top functions.
perf
profiling tool in Linux is statistical sampling profiler (without binary instrumentation): it programs software timer or event source or hardware performance monitoring unit (PMU) to generate periodic interrupt. In your example
-c 10000 -e mem_load_uops_retired.l3_miss:uppp
is used to select hardware PMU in x86_64 in some kind of PEBS mode (https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR) to generate interrupt after 10000 of mem_load_uops_retired (with l3_miss mask). Generated interrupt is handled by Linux Kernel (perf_events subsystem, kernel/events and arch/x86/events). In this handler PMU is reset (reprogrammed) to generate next interrupt after 10000 more events and sample is generated. Sample data dump is saved into perf.data file by perf report
command, but every wake of tool can save thousands of samples; samples can be read by perf script
or perf script -D
.
perf_events interrupt handler, something near __perf_event_overflow
of kernel/events/core.c, has full access to the registers of current function, and has some time to do additional data retrieval to record current time, pid, etc. Part of such process is https://en.wikipedia.org/wiki/Call_stack data collection. But with x86_64 and -fomit-frame-pointer (often enabled for many system libraries of Debian/Ubuntu/others) there is no default place in registers or in function stack to store frame pointers:
-fomit-frame-pointer
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions. It also makes debugging impossible on some machines.Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86 targets has been changed to -fomit-frame-pointer. The default can be reverted to -fno-omit-frame-pointer by configuring GCC with the --enable-frame-pointer configure option.
With frame pointers saved in the function stack backtracing/unwinding is easy. But for some functions modern gcc (and other compilers) may not generate frame pointer. So backtracing code like in perf_events handler either will stop backtrace at such function or needs another method of frame pointer recovery. Option -g method
(--call-graph
) of perf record
selects the method to be used. It is documented in man perf-record
http://man7.org/linux/man-pages/man1/perf-record.1.html:
--call-graph
Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Default is "fp".Allows specifying "fp" (frame pointer) or "dwarf" (DWARF's CFI - Call Frame Information) or "lbr" (Hardware Last Branch Record facility) as the method to collect the information used to show the call graphs.
In some systems, where binaries are build with gcc
--fomit-frame-pointer, using the "fp" method will produce bogus call graphs, using "dwarf", if available (perf tools linked to the libunwind or libdw library) should be used instead. Using the "lbr" method doesn't require any compiler options. It will produce call graphs from the hardware LBR registers. The main limitation is that it is only available on new Intel platforms, such as Haswell. It can only get user call chain. It doesn't work with branch stack sampling at the same time.When "dwarf" recording is used, perf also records (user) stack dump when sampled. Default size of the stack dump is 8192 (bytes). User can change the size by passing the size after comma like
"--call-graph dwarf,4096".
So, dwarf method reuses CFI tables to find stack frame sizes and find caller's stack frame. I'm not sure are CFI tables stripped from release libraries by default or not; but debuginfo probably will have them. LBR will not help because it is rather short hardware buffer. Dwarf split processing (kernel handler saves part of stack and perf user-space tool will parse it with libdw+libunwind) may lose some parts of call stack, so try also to increase dwarf stack dumps by using --call-graph dwarf,10240
or --call-graph dwarf,81920
etc.
Backtracing is implemented in arch-dependent part of perf_events: arch/x86/events/core.c:perf_callchain_user()
; called from kernel/events/callchain.c:get_perf_callchain()
<- perf_callchain <- perf_prepare_sample <-
__perf_event_output <- *(event->overflow_handler)
<- READ_ONCE(event->overflow_handler)(event, data, regs);
of __perf_event_overflow
.
Gregg did warn about incomplete call stacks of perf: http://www.brendangregg.com/blog/2014-06-22/perf-cpu-sample.html
Incomplete stacks usually mean -fomit-frame-pointer was used – a compiler optimization that makes little positive difference in the real world, but breaks stack profilers. Always compile with -fno-omit-frame-pointer. More recent perf has a -g dwarf option, to use the alternate libunwind/dwarf method for retrieving stacks.
I also did write about backtraces in perf with some additional links: How does linux's perf utility understand stack traces?