c++gcc optimization branch-prediction likely-unlikely

Can I improve branch prediction with my code?

This is a naive general question open to any platform, language, or compiler. Though I am most curious about Aarch64, C++, GCC.

When coding an unavoidable branch in program flow dependent on I/O state (compiler cannot predict), and I know that one state is much more likely than another, how do I indicate that to the compiler?

Is this better

if(true == get(gpioVal))
    unlikelyFunction();
else
    likelyFunction();

than this?

if(true == get(gpioVal))
    likelyFunction(); // performance critical, fill prefetch caches from this branch
else
    unlikelyFunction(); // missed prediction not consequential on this branch

Does it help if the communication protocol makes the more likely or critical value true(high), or false(low)?

Solution

TL:DR: Yes, in C or C++ it's possible (but usually not recommended) to use a likely() macro, or C++20 [[likely]] to help the compiler make better asm. That's separate from influencing actual CPU branch-prediction, though. If writing in asm, lay out your code to minimize taken branches for the common case fast-path.

For most ISAs, there's no way in asm to hint the CPU whether a branch is likely to be taken or not. (Some exceptions include Pentium 4 (but not earlier or later x86), PowerPC, and some MIPS, which allow branch hints as part of conditional-branch asm instructions.)

Is it possible to tell the branch predictor how likely it is to follow the branch?
Update: x86 branch hints are back in 2024 with Meteor Lake P cores (Redwood Cove), see the Chips & Cheese article. The Granite Rapids Xeon version of the same core will also add software-prefetch for code. (Previously the best option was SW data prefetch into L2.)

But not-taken straight-line code is cheaper than taken, so hinting high-level language to lay out code with the fast-path contiguous doesn't help branch prediction accuracy, but can help (or hurt) performance. (I-cache locality, front-end bandwidth: remember code-fetch happens in contiguous 16 or 32-byte blocks, so a taken branch means a later part of that fetch block isn't useful. Also, branch prediction throughput; some CPUs like Intel Skylake for example can't handle a predicted-taken branch at more than 1 per 2 clocks, other than loop branches. That include unconditional branches like jmp or ret.)

Taken branches are hard; not-taken branches keep the CPU on its toes, but if the prediction is accurate it's just a normal instruction for an execution unit (verifying the prediction), with nothing special for the front-end. See also Modern Microprocessors A 90-Minute Guide! which has a section on branch prediction. (And is overall excellent.)

Many people misunderstand source-level branch hints as branch prediction hints. That could be one effect if compiling for a CPU that supports branch hints in asm, but for most the significant effect is in layout, and deciding whether to use branchless (cmov) or not; a [[likely]] condition also means it should predict well.

With some CPUs, especially older, layout of a branch did sometimes influence runtime prediction: if the CPU didn't remember anything about the branch in its dynamic predictors, the standard static prediction heuristic is that forward conditional branches are not-taken, backward conditional are assumed taken (because that's normally the bottom of a loop. See the BTFNT section in https://danluu.com/branch-prediction/.

A compiler can lay out an if(c) x else y; either way, either matching the source with jump over x if !c as the opening thing, or swap the if and else blocks and use the opposite branch condition. Or it can put one block out-of-line (e.g. after the ret at the end of the function) so the fast path has no taken branches conditional or otherwise, while the less likely path has to jump there and then jump back.

It's easy to do more harm than good with branch hints in high-level source, especially if surrounding code changes without paying attention to them, so profile-guided optimization is the best way for compilers to learn about branch predictability and likelihood. (e.g. gcc -O3 -fprofile-generate / run with some representative inputs that exercise code-paths in relevant ways / gcc -O3 -fprofile-use)

But there are ways to hint in some languages, like C++20 [[likely]] and [[unlikely]], which are the portable version of GNU C likely() / unlikely() macros around __builtin_expect.

https://en.cppreference.com/w/cpp/language/attributes/likely C++20 [[likely]]
How to use C++20's likely/unlikely attribute in if-else statement syntax help
Is there a compiler hint for GCC to force branch prediction to always go a certain way? (to the literal question, no. To what's actually wanted, branch hints to the compiler, yes.)
How do the likely/unlikely macros in the Linux kernel work and what is their benefit? The GNU C macros using __builtin_expect, same effect but different syntax than C++20 [[likely]]
What is the advantage of GCC's __builtin_expect in if else statements? example asm output. (Also see CiroSantilli's answers to some of the other questions where he made examples.)
Simple example where [[likely]] and [[unlikely]] affect program assembly?

I don't know of ways to annotate branches for languages other than GNU C / C++, and ISO C++20.

Absent any hints or profile data

Without that, optimizing compilers have to use heuristics to guess which side of a branch is more likely. If it's a loop branch, they normally assume that the loop will run multiple times. On an if, they have some heuristics based on the actual condition and maybe what's in the blocks being controlled; IDK I haven't looked into what gcc or clang do.

I have noticed that GCC does care about the condition, though. It's not as naive as assuming that int values are uniformly randomly distributed, although I think it normally assumes that if (x == 10) foo(); is somewhat unlikely.

JIT compilers like in a JVM have an advantage here: they can potentially instrument branches in the early stages of running, to collect branch-direction information before making final optimized asm. OTOH they need to compile fast because compile time is part of total run time, so they don't try as hard to make good asm, which is a major disadvantage in terms of code quality.