performanceoptimizationperfperformancecounter

perf report: Understanding the first line of output


I was looking for an answer but found nothing definitive. How do I interpret the first line in perf report output. It goes like this:

Samples: 173M of event 'cache-misses', Event count (approx.): 461731712088

Practically every tutorial I've seen goes over everything but this first line. This page explains differences between samples and counts but I want to be 100% sure I'm not making wrong conclusions here: in my example, what do 173M and 461731712088 mean? From what I've read, I'm guessing the second one is the total number of cache misses that have occurred during the runtime and the first one is the number of cache misses that were recorded and accounted for displaying statistics. Is this right or am I misinterpreting the output?


Solution

  • You're correct. For hardware events (not software events like page faults and context switches), perf record works by programming HW counter in the PMU to record a sample every n occurrences of the hardware event. (Where "recording a sample" means writing the PEBS buffer1 or just raising an interrupt on the spot.)

    n is chosen to give some sample frequency that doesn't cause so many interrupts that it hugely distorts performance, but also collects a reasonable number of samples for that event over a few seconds to minutes of running whatever you're profiling. That might mean adjusting n on the fly, or having defaults for different events. (Like instructions which typically happens more than 1 per clock vs. very rare events like machine_clears.count.)

    The actual PMU hardware gets programmed with n, and counts down toward 0 (or maybe up towards n and compares).

    perf stat works by setting a huge n, as large as the HW supports, so it only has to interrupt for counter rollover as infrequently as possible. And by collecting the final counter value at the end. (Software can read/write the exact count at any time; that's how the kernel virtualizes the counters on context-switches.) This might be a simplification of some details, but AFAIK is accurate in explaining why perf stat has basically no overhead and is able to give very precise and repeatable counts.

    But perf record with a file of samples and data on the n used to collect them can only extrapolate total counts from samples * n, hence the "approx."


    Footnote 1: The PEBS buffer is apparently usually very small, like 1 sample, so it doesn't save on interrupts much/at all, but it does precisely attribute the sample to an instruction, rather than one nearby: see the skids and PEBS section in https://www.brendangregg.com/perf.html . Great for events like mem_load_retired.l3_miss. Events like cycles still have to pick one instruction to "blame", and that's usually the one waiting for a slow input, e.g. the instruction trying to use the result of a cache-miss load, not the load itself.

    BTW, the :u / :k filters to count only in user-space or kernel-space are something the hardware supports. So the kernel isn't having to manage the counters on every interrupt. (And in fact doesn't, so if you don't use :u or --all-user, your profiling will include interrupt handlers that ran while your task was current on a core.)