I'm trying to understand in detail what happens to instructions in the various stages of the skylake CPU pipeline when a branch is mis-predicted, and how quickly instructions from the correct branch destination can start executing.
So lets label the two code paths here as red (the one predicted, but not actually taken) and green (the one taken, but not predicted). So questions are: 1. How far through the pipeline does the branch have to get before red instructions start being discarded (and at what stage(s) of the pipeline are they discarded)? 2. How soon (in terms of the pipeline stage reached by the branch) can green instructions start being executed?
I've looked at Agner Fogg's documents and numerous sets of lecture notes, but found no clarity on these points.
The branch execution unit (on ports 0 and 6) are what actually check the FLAGS or indirect-branch address for conditional or indirect branches. I think that recovery begins as soon as an execution unit discovers it, without waiting for it to reach retirement. (Some of this is my best guess / understanding, not necessarily backed up by Intel's optimization manual.)
Branch prediction + speculative execution decouples data dependencies from control dependencies, but the branch uop itself does have a data dependency on EFLAGS or an indirect address input.
The branch unit on p0 can only run predicted-not-taken JCC uops (or macro-fused JCC uops), but those are common. The branch unit on p6 is the "main" one which handles taken branches.
For direct branches (jmp rel8/rel32
/ call rel32
), prediction can be checked on decode and re-steer the fetch stages, maybe stalling the front-end but I think never needing to trigger any kind of recovery in the back end. Uops from the wrong path would never be issued for direct unconditional branches. There are perf counters for pipeline re-steer.
Branch mispredicts have fast recovery with a branch-order-buffer, unlike the usual rollback to retirement state on exceptions (When an interrupt occurs, what happens to instructions in the pipeline?). For more about how the pipeline treats everything as speculative until retirement, see Out-of-order execution vs. speculative execution.
According to David Kanter's Sandybridge microarch writeup:
Nehalem enhanced the recovery from branch mispredictions, which has been carried over into Sandy Bridge. Once a branch misprediction is discovered, the core is able to restart decoding as soon as the correct path is known, at the same time that the out-of-order machine is clearing out uops from the wrongly speculated path. Previously, the decoding would not resume until the pipeline was fully flushed.
(Skylake's BOB has 48 entries, so 48 not-yet-executed branches in flight. I think BOB entries should be freed on execution of a branch uop, not needing to stay allocated until retirement.)
This is the "fast recovery" enabled by a branch-order buffer that snapshots reg-renaming state on conditional and indirect branch instructions, which are expected to mispredict even in normal programs. But exceptions and memory-ordering machine clears are more expensive. They do happen (especially page faults), but they're rarer and harder to optimize for.
The key point of fast recovery is that uops from before the mispredicted branch which are already in the ROB + RS (scheduler) can keep executing while later uops are being discarded and the front-end re-steered to the correct address. So if the inputs to a JCC uop are ready early enough, most of the branch-miss penalty can be hidden if there's a long dependency chain the CPU can be chewing on while recovering. e.g. The mispredict on exit from a loop with a decent-length loop carried dep chain, or any bottleneck other than total uop throughput or a port 6 bottleneck. See Avoid stalling pipeline by calculating conditional early
Without fast recovery, I think all uops in the ROB would be discarded (i.e. all not-retired uops). There might be some middle ground here, like keeping already-executed uops from before the branch that were in the ROB but had left the scheduler. I don't know what Merom/Conroe did exactly.
Related: Characterizing the Branch Misprediction Penalty is an interesting paper about how branch misses and long cache misses interact with the ROB. It's based on a simplified pipeline model, but it looks to me like its findings probably apply to Skylake.