c++optimizationx86profilingperf

cpu_core vs cpu_atom in perf


I'm constructing an example that shows the effect of branch mispredictions. When using perf stat, I get the following results: enter image description here

Here, I can see some metrics counted twice, once for cpu_atom, and once for cpu_core. What is the difference between these two?

I've read that cpu_core corresponds to ISA instructions, while cpu_atom corresponds to microarchitecture internals (in my case, x86 micro-ops). This is somewhat confusing, since I would expect numbers for cpu_atom to be bigger than numbers for cpu_core

It's also a bit confusing how the two, cpu_core and cpu_atom metrics differ relative to each other on multiple runs: enter image description here This is a much different fraction than the previous run.

There are also times where cpu_atom metrics are not counted: enter image description here

And there is this run... enter image description here I assume the 191.02% is a bug. This is 110,238,856 / 57,711,854, which is the cpu_core/branch-misses / cpu_atom/branchs. If this is not a bug, I wonder why divide metrics from cpu_core by metrics from cpu_atom.

Just for reference, here is the code of the ran executable:

#include <benchmark/benchmark.h>

#include <algorithm>
#include <vector>

void test(benchmark::State& s) {
    const auto N = s.range(0);
    std::vector<unsigned long> v1(N), v2(N), c(N);
    
    srand(1);
    std::generate(v1.begin(), v1.end(), [] { return rand(); });
    std::generate(v2.begin(), v2.end(), [] { return rand(); });
#ifdef HIT
    std::generate(c.begin(), c.end(), [] { return rand() >= 0; });
#else
    std::generate(c.begin(), c.end(), [] { return rand() & 1; });
#endif

    for (auto _ : s) {
        unsigned long result = 0;
        for (int i = 0; i < N; i++) {
            if (c[i]) {
                result += v1[i];
            } else {
                result *= v2[i];
            }
        }
        benchmark::DoNotOptimize(result);
        benchmark::ClobberMemory();
    }
}

BENCHMARK(test)->Arg(1 << 22);
BENCHMARK_MAIN();

Compiled as following:

g++ branch_prediction.cpp -o miss -g3 -O3 -mavx2 -lbenchmark

Solution

  • cpu_atom is from E cores. cpu_core is from P cores. (https://superuser.com/questions/1677692/what-are-performance-and-efficiency-cores-in-intels-12th-generation-alder-lake/1677779#1677779)

    If you want only one or the other, use taskset -c 1 ./a.out to limit it to running on core #1 for example. Note that cpu_migrations is 11 in your first image so it started didn't run on the same core the whole time, including moving between E and P cores.

    I've read that cpu_core corresponds to ISA instructions, while cpu_atom corresponds to microarchitecture internals (in my case, x86 micro-ops).

    No, completely wrong. The counters for micro-ops include uops_issued.any (front-end fused-domain issue/rename), uops_executed.thread (back-end execution ports, unfused domain), and uops_retired.retire_slots (back-end retirement, matches uops_issued.any if there was no mis-speculation).
    These events exist on my Skylake, presumably also on P-cores (cpu_core).
    Probably also on E-cores (cpu_atom) even though that's a very different microarchitecture (Gracemont).