linuxprofilingwaitperf

How to use linux `perf` tool to generate "Off-CPU" profile


Brendan D. Gregg (author of DTrace book) has interesting variant of profiling: the "Off-CPU" profiling (and Off-CPU Flame Graph; slides 2013, p112-137) to see, where the thread or application were blocked (was not executed by CPU, but waiting for I/O, pagefault handler, or descheduled due short of CPU resources):

This time reveals which code-paths are blocked and waiting while off-CPU, and for how long exactly. This differs from traditional profiling which often samples the activity of threads at a given interval, and (usually) only examine threads if they are executing work on-CPU.

He also can combine Off-CPU profile data and On-CPU profile together: http://www.brendangregg.com/FlameGraphs/hotcoldflamegraphs.html

The examples given by Gregg are made using dtrace, which is not usually available in Linux OS. But there are some similar tools (ktap, systemtap, perf) and the perf as I think has widest installed base. Usually perf generated On-CPU profiles (which functions were executed more often on CPU).

PS: There is link to systemtap variant of Off-CPU flamegraphs in the slides from LISA13, p124: "Yichun Zhang created these, and has been using them on Linux with SystemTap to collect the profile data. See: • http://agentzh.org/misc/slides/off-cpu-flame-graphs.pdf"" (CloudFlare Beer Meeting on 23 August 2013)


Solution

  • The perf technique I published[1] was a high-overhead workaround, until perf has BPF support for doing this.

    Right now, the lowest cost way of generating an off-CPU flame graph on Linux is on a 4.6+ kernel (which has BPF stack trace support), and with bcc/BPF. I wrote a tool for it, offcputime[2], which can be run with a -f option for "folded output", suitable for feeding into flamegraph.pl. This offcputime tool does the timing and stack counting all in kernel content, and dumps a report that is then printed with symbols.

    One day, I expect that perf itself will be able to do this as well: run a BPF program that does the in-kernel counting, and dumping of a report.

    In the meantime, we can use bcc/BPF. If for some reason you can't use bcc, you can, right now, take that offcputime program and write it in C. A more complicated version is available in the Linux source, as samples/bpf/offwaketime*. With the new BPF features on Linux, if there's a will, there's a way.

    [1] http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html

    [2] https://github.com/iovisor/bcc/blob/master/tools/offcputime_example.txt