c++optimizationbranch-prediction

Why branch (miss)prediction doesn't impact performance (C++)?


While trying to measure the impact of branch miss prediction, I've noticed that there is no penalty at all to branch miss prediction.

Based on the famous stack overflow question : Why is processing a sorted array faster than processing an unsorted array?

I wrote a simple piece of code to measure the penalty of branch prediction.

After running the code I got pretty much the same results for both measurements.

Tested on:

  1. Visual studio 2017, release (Maximum Optimization (Favor Speed) (/O2)), windows.
  2. Linux, g++ -Ofast

Then I took the original code from the question I've linked above, and still didn't got any improvement for the sorted array. Why is that ? where is the benefit of branch prediction ?

#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>

int main()
{
    // Step 1: Allocate a vector of size 1 million
    std::vector<int> vec(100'000'000);

    // Step 2: Fill it with random numbers between 0 and 10
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(0, 10);

    for (auto& val : vec)
    {
        val = dis(gen);
    }

    // Step 3: Count numbers above 5 (and measure time)
    auto start = std::chrono::high_resolution_clock::now();
    int count_above_5 = 0;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            ++count_above_5;
        }
    }

    auto end = std::chrono::high_resolution_clock::now();

    auto duration_before_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (before sorting): " << count_above_5 << std::endl;
    std::cout << "Time taken (before sorting): " << duration_before_sort << " ns" << std::endl;

    // Step 4: Sort the array
    std::sort(vec.begin(), vec.end());

    // Step 5: Count numbers above 5 in the sorted array (and measure time)

    start = std::chrono::high_resolution_clock::now();
    count_above_5 = 0;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            ++count_above_5;
        }
    }
    end = std::chrono::high_resolution_clock::now();


    auto duration_after_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (after sorting): " << count_above_5 << std::endl;
    std::cout << "Time taken (after sorting):  " << duration_after_sort << " ns" << std::endl;

    return 0;
}

Solution

  • TL;DR: GCC, MSVC and Clang all generate a branch-less assembly code so there is no actual branch hence no impact of branch (miss)prediction.


    On Linux, with GCC 14.2.0, on my machine, there is no impact due to branch prediction because there is actually no branches. Indeed, GCC generates a branch-less assembly code with SIMD instructions:

    180:   
        movdqu  xmm2,XMMWORD PTR [rax]   ; Load a block of items from the array `vec`
        movdqa  xmm1,XMMWORD PTR [rsp]   ; Reload xmm1 from memory
        add     rax,0x10                 ; Move the `rax` pointer to the next block
        pcmpgtd xmm1,xmm2                ; Compare items of the block with 5 and move the mask in `xmm1`
        psubd   xmm0,xmm1                ; Increment the number of item found in `xmm0`
        cmp     rax,rbp                  ; Loop until we reach the last block
        jne     180            
    

    On Godbolt, we can see that both MSVC and Clang also generate a branch-less code. Here is the code produced by MSVC (do not use SIMD instructions but cmovge instead which should be less efficient):

    $LL7@main:
        lea     eax, DWORD PTR [rdi+1]
        cmp     DWORD PTR [rsi+rcx*4], 5
        cmovge  eax, edi
        mov     edi, eax
        inc     rcx
        cmp     rcx, r14
        jb      SHORT $LL7@main
    

    Here the code produce by Clang (uses SIMD instruction and unroll the loop 4 times):

    .LBB0_10:
        movdqu  xmm0, xmmword ptr [rbx + 4*r12 - 48]
        movdqu  xmm1, xmmword ptr [rbx + 4*r12 - 32]
        movdqu  xmm2, xmmword ptr [rbx + 4*r12 - 16]
        movdqu  xmm3, xmmword ptr [rbx + 4*r12]
        movdqa  xmm4, xmm5
        pcmpgtd xmm4, xmm0
        psubd   xmm6, xmm4
        movdqa  xmm0, xmm5
        pcmpgtd xmm0, xmm1
        psubd   xmm7, xmm0
        movdqa  xmm0, xmm5
        pcmpgtd xmm0, xmm2
        psubd   xmm6, xmm0
        movdqa  xmm0, xmm5
        pcmpgtd xmm0, xmm3
        psubd   xmm7, xmm0
        add     r12, 16
        cmp     r12, 100012
        jne     .LBB0_10