As noted in the Intel optimization manual:
The default predicted target for indirect branches and calls is the fall-through path. Fall-through prediction is overridden if and when a hardware prediction is available for that branch.
For indirect JMPs, we merely execute the next straight-line instruction. The default prediction mostly hurts performance (can cause resource conflicts and slow down branch recovery) before hardware prediction is available. And it introduces Straight-Line Speculation (SLS) vulnerability.
Placing data immediately following an indirect branch can cause a performance problem. If the data consists of all zeros, it looks like a long stream of ADDs to memory destinations and this can cause resource conflicts and slow down branch recovery. Also, data immediately following indirect branches may appear as branches to the branch predication hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.
This is the same for indirect calls, except the decoded instructions can be reused.
So, why don't the CPU designers just stop the speculation as INT3/UD2 does in such a case?
It does sometimes help in code with jump tables optimized following Intel's recommendation to make one of the targets (preferably the most common) a fall-through, often possible for switch
statements. (Not so much for indirect call
)
I think that behaviour was designed into P6 long before hyperthreading was a thing so stealing execution resources from the other logical core wasn't a thing for a path of execution that's unlikely to be the correct one.
Oldest-ready-first uop scheduling makes this speculative path of execution not very likely to steal cycles from code before the jump. I think an in-flight div
or load can get cancelled without waiting for it to complete, so that shouldn't be a big factor; do you have any data to support your concern about resource conflicts with earlier work from the known-good path of execution? I guess if a mis-speculated load used up an LFB waiting for a useless cache line, that could delay progress on a useful load whose address wasn't ready until just after that. And it can of course pollute caches and TLBs.
Spectre was only conceived around 2017; before that, CPU architects weren't even considering any kind of security threat from speculative execution that didn't affect the architectural state. If any Intel architects had any conception of that kind of vulnerability back in the early 90s when P6 was being designed, Meltdown wouldn't have been a thing, nor most MDS vulnerabilities.
If the CPU did stop fetching, something would need to restart it. I guess perhaps executing the indirect jmp
/ call
that produces a correct address could trigger that, but it might need a special mechanism? (Or maybe not, by the time I finished writing this section, I think probably not.)
ud2
/ int
trap if they reach retirement, which is a whole complicated thing that always involves restarting fetch from a new location, with the ROB (reorder buffer) and scheduler already empty since those instructions always stop fetch. Unlike an indirect call or jump which in your proposed design would still speculate if a branch-target prediction was available.
So I suspect there's a benefit in simplicity of the CPU internals for the current design, with fewer special cases in different parts of the CPU. That might not be a big deal in terms of number of transistors needed these days, but it might have been significant in first-gen P6.
The branch-recovery mechanism is obviously highly optimized to keep branch miss latency as low as possible. IDK if there's any obstacle to hooking into that mechanism for something that stopped fetch/decode and needs to restart it. Probably not; a mis-speculated int
or ud2
could have stopped fetch, and executing the branch needs to restart fetch.
So an indirect jmp
or call
already needs to be capable of restarting fetch, so probably it's not a big deal.