While trying to measure the impact of branch miss prediction, I've noticed that there is no penalty at all to branch miss prediction.
Based on the famous stack overflow question : Why is processing a sorted array faster than processing an unsorted array?
I wrote a simple piece of code to measure the penalty of branch prediction.
After running the code I got pretty much the same results for both measurements.
Tested on:
Then I took the original code from the question I've linked above, and still didn't got any improvement for the sorted array. Why is that ? where is the benefit of branch prediction ?
#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
int main()
{
// Step 1: Allocate a vector of size 1 million
std::vector<int> vec(100'000'000);
// Step 2: Fill it with random numbers between 0 and 10
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, 10);
for (auto& val : vec)
{
val = dis(gen);
}
// Step 3: Count numbers above 5 (and measure time)
auto start = std::chrono::high_resolution_clock::now();
int count_above_5 = 0;
for (size_t i = 0; i < vec.size(); i++)
{
if (vec[i] < 5)
{
++count_above_5;
}
}
auto end = std::chrono::high_resolution_clock::now();
auto duration_before_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Count of numbers above 5 (before sorting): " << count_above_5 << std::endl;
std::cout << "Time taken (before sorting): " << duration_before_sort << " ns" << std::endl;
// Step 4: Sort the array
std::sort(vec.begin(), vec.end());
// Step 5: Count numbers above 5 in the sorted array (and measure time)
start = std::chrono::high_resolution_clock::now();
count_above_5 = 0;
for (size_t i = 0; i < vec.size(); i++)
{
if (vec[i] < 5)
{
++count_above_5;
}
}
end = std::chrono::high_resolution_clock::now();
auto duration_after_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Count of numbers above 5 (after sorting): " << count_above_5 << std::endl;
std::cout << "Time taken (after sorting): " << duration_after_sort << " ns" << std::endl;
return 0;
}
TL;DR: GCC, MSVC and Clang all generate a branch-less assembly code so there is no actual branch hence no impact of branch (miss)prediction.
On Linux, with GCC 14.2.0, on my machine, there is no impact due to branch prediction because there is actually no branches. Indeed, GCC generates a branch-less assembly code with SIMD instructions:
180:
movdqu xmm2,XMMWORD PTR [rax] ; Load a block of items from the array `vec`
movdqa xmm1,XMMWORD PTR [rsp] ; Reload xmm1 from memory
add rax,0x10 ; Move the `rax` pointer to the next block
pcmpgtd xmm1,xmm2 ; Compare items of the block with 5 and move the mask in `xmm1`
psubd xmm0,xmm1 ; Increment the number of item found in `xmm0`
cmp rax,rbp ; Loop until we reach the last block
jne 180
On Godbolt, we can see that both MSVC and Clang also generate a branch-less code. Here is the code produced by MSVC (do not use SIMD instructions but cmovge
instead which should be less efficient):
$LL7@main:
lea eax, DWORD PTR [rdi+1]
cmp DWORD PTR [rsi+rcx*4], 5
cmovge eax, edi
mov edi, eax
inc rcx
cmp rcx, r14
jb SHORT $LL7@main
Here the code produce by Clang (uses SIMD instruction and unroll the loop 4 times):
.LBB0_10:
movdqu xmm0, xmmword ptr [rbx + 4*r12 - 48]
movdqu xmm1, xmmword ptr [rbx + 4*r12 - 32]
movdqu xmm2, xmmword ptr [rbx + 4*r12 - 16]
movdqu xmm3, xmmword ptr [rbx + 4*r12]
movdqa xmm4, xmm5
pcmpgtd xmm4, xmm0
psubd xmm6, xmm4
movdqa xmm0, xmm5
pcmpgtd xmm0, xmm1
psubd xmm7, xmm0
movdqa xmm0, xmm5
pcmpgtd xmm0, xmm2
psubd xmm6, xmm0
movdqa xmm0, xmm5
pcmpgtd xmm0, xmm3
psubd xmm7, xmm0
add r12, 16
cmp r12, 100012
jne .LBB0_10