I'm constructing an example that shows the effect of branch mispredictions. When using perf stat
, I get the following results:
Here, I can see some metrics counted twice, once for cpu_atom
, and once for cpu_core
. What is the difference between these two?
I've read that cpu_core
corresponds to ISA instructions, while cpu_atom
corresponds to microarchitecture internals (in my case, x86 micro-ops). This is somewhat confusing, since I would expect numbers for cpu_atom
to be bigger than numbers for cpu_core
It's also a bit confusing how the two, cpu_core
and cpu_atom
metrics differ relative to each other on multiple runs:
This is a much different fraction than the previous run.
There are also times where cpu_atom
metrics are not counted:
And there is this run...
I assume the 191.02% is a bug. This is 110,238,856 / 57,711,854, which is the
cpu_core/branch-misses
/ cpu_atom/branchs
. If this is not a bug, I wonder why divide metrics from cpu_core
by metrics from cpu_atom
.
Just for reference, here is the code of the ran executable:
#include <benchmark/benchmark.h>
#include <algorithm>
#include <vector>
void test(benchmark::State& s) {
const auto N = s.range(0);
std::vector<unsigned long> v1(N), v2(N), c(N);
srand(1);
std::generate(v1.begin(), v1.end(), [] { return rand(); });
std::generate(v2.begin(), v2.end(), [] { return rand(); });
#ifdef HIT
std::generate(c.begin(), c.end(), [] { return rand() >= 0; });
#else
std::generate(c.begin(), c.end(), [] { return rand() & 1; });
#endif
for (auto _ : s) {
unsigned long result = 0;
for (int i = 0; i < N; i++) {
if (c[i]) {
result += v1[i];
} else {
result *= v2[i];
}
}
benchmark::DoNotOptimize(result);
benchmark::ClobberMemory();
}
}
BENCHMARK(test)->Arg(1 << 22);
BENCHMARK_MAIN();
Compiled as following:
g++ branch_prediction.cpp -o miss -g3 -O3 -mavx2 -lbenchmark
cpu_atom
is from E cores. cpu_core
is from P cores. (https://superuser.com/questions/1677692/what-are-performance-and-efficiency-cores-in-intels-12th-generation-alder-lake/1677779#1677779)
If you want only one or the other, use taskset -c 1 ./a.out
to limit it to running on core #1 for example. Note that cpu_migrations
is 11 in your first image so it started didn't run on the same core the whole time, including moving between E and P cores.
I've read that cpu_core corresponds to ISA instructions, while cpu_atom corresponds to microarchitecture internals (in my case, x86 micro-ops).
No, completely wrong. The counters for micro-ops include uops_issued.any
(front-end fused-domain issue/rename), uops_executed.thread
(back-end execution ports, unfused domain), and uops_retired.retire_slots
(back-end retirement, matches uops_issued.any
if there was no mis-speculation).
These events exist on my Skylake, presumably also on P-cores (cpu_core
).
Probably also on E-cores (cpu_atom
) even though that's a very different microarchitecture (Gracemont).