Why branch (miss)prediction doesn't impact performance (C++)?

While trying to measure the impact of branch miss prediction, I've noticed that there is no penalty at all to branch miss prediction.

Based on the famous stack overflow question : Why is processing a sorted array faster than processing an unsorted array?

I wrote a simple piece of code to measure the penalty of branch prediction.

Fill an array with random numbers
Count the numbers above 5 (should be many miss predictions) - measure it
Sort the array
Count the numbers above 5 (should be few miss predictions) - measure it

After running the code I got pretty much the same results for both measurements.

Tested on:

Visual studio 2017, release (Maximum Optimization (Favor Speed) (/O2)), windows.
Linux, g++ -Ofast

Then I took the original code from the question I've linked above, and still didn't got any improvement for the sorted array. Why is that ? where is the benefit of branch prediction ?

#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>

int main()
{
    // Step 1: Allocate a vector of size 1 million
    std::vector<int> vec(100'000'000);

    // Step 2: Fill it with random numbers between 0 and 10
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(0, 10);

    for (auto& val : vec)
    {
        val = dis(gen);
    }

    // Step 3: Count numbers above 5 (and measure time)
    auto start = std::chrono::high_resolution_clock::now();
    int count_above_5 = 0;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            ++count_above_5;
        }
    }

    auto end = std::chrono::high_resolution_clock::now();

    auto duration_before_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (before sorting): " << count_above_5 << std::endl;
    std::cout << "Time taken (before sorting): " << duration_before_sort << " ns" << std::endl;

    // Step 4: Sort the array
    std::sort(vec.begin(), vec.end());

    // Step 5: Count numbers above 5 in the sorted array (and measure time)

    start = std::chrono::high_resolution_clock::now();
    count_above_5 = 0;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            ++count_above_5;
        }
    }
    end = std::chrono::high_resolution_clock::now();


    auto duration_after_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (after sorting): " << count_above_5 << std::endl;
    std::cout << "Time taken (after sorting):  " << duration_after_sort << " ns" << std::endl;

    return 0;
}

Solution

TL;DR: GCC, MSVC and Clang all generate a branch-less assembly code so there is no actual branch hence no impact of branch (miss)prediction.

On Linux, with GCC 14.2.0, on my machine, there is no impact due to branch prediction because there is actually no branches. Indeed, GCC generates a branch-less assembly code with SIMD instructions:

180:   
    movdqu  xmm2,XMMWORD PTR [rax]   ; Load a block of items from the array `vec`
    movdqa  xmm1,XMMWORD PTR [rsp]   ; Reload xmm1 from memory
    add     rax,0x10                 ; Move the `rax` pointer to the next block
    pcmpgtd xmm1,xmm2                ; Compare items of the block with 5 and move the mask in `xmm1`
    psubd   xmm0,xmm1                ; Increment the number of item found in `xmm0`
    cmp     rax,rbp                  ; Loop until we reach the last block
    jne     180

On Godbolt, we can see that both MSVC and Clang also generate a branch-less code. Here is the code produced by MSVC (do not use SIMD instructions but cmovge instead which should be less efficient):

$LL7@main:
    lea     eax, DWORD PTR [rdi+1]
    cmp     DWORD PTR [rsi+rcx*4], 5
    cmovge  eax, edi
    mov     edi, eax
    inc     rcx
    cmp     rcx, r14
    jb      SHORT $LL7@main

Here the code produce by Clang (uses SIMD instruction and unroll the loop 4 times):

.LBB0_10:
    movdqu  xmm0, xmmword ptr [rbx + 4*r12 - 48]
    movdqu  xmm1, xmmword ptr [rbx + 4*r12 - 32]
    movdqu  xmm2, xmmword ptr [rbx + 4*r12 - 16]
    movdqu  xmm3, xmmword ptr [rbx + 4*r12]
    movdqa  xmm4, xmm5
    pcmpgtd xmm4, xmm0
    psubd   xmm6, xmm4
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm1
    psubd   xmm7, xmm0
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm2
    psubd   xmm6, xmm0
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm3
    psubd   xmm7, xmm0
    add     r12, 16
    cmp     r12, 100012
    jne     .LBB0_10