hardwarecpu-architectureprocessors

Do any microprocessors today use Scoreboarding or Tomasulo's algorithm?


I've researched a bit and i found out about Intel pentium pro, AMD K7, IBM power PC but these are pretty old. I'm not able to find any info about current day processors that use these mechanisms for dynamic scheduling


Solution

  • Every modern OoO exec CPU uses Tomasulo's algorithm for register renaming. The basic idea of renaming onto more physical registers in a kind of SSA dependency analysis hasn't changed.

    Modern Intel CPUs like Skylake have evolved some since Pentium Pro (e.g. renaming onto a physical register file instead of holding data right in the ROB), but PPro and the P6 family is a direct ancestor of the Sandybridge-family. See https://www.realworldtech.com/sandy-bridge/ for some discussion of the first member of that new family. (And if you're curious about CPU internals, a much more in-depth look at it.) See also https://agner.org/optimize/ but Agner's microarch guide focuses more on how to optimize for it, e.g. that register renaming isn't a bottleneck on modern CPUs: rename width matches issue width, and the same register can be renamed 4 times in an issue group of 4 instructions.

    Advancements in managing the RAT include Nehalem introducing fast-recovery for branch misses: snapshot the RAT on branches so you can restore to there when you detect a branch miss, instead of draining earlier un-executed uops before starting recovery.

    Also mov-elimination and xor-zeroing elimination: they're handled at register-rename time instead of needing a back-end uop to write the register. (For xor-zeroing, presumably there's a physical zero register and zeroing idioms point the architectural register at that physical zero. What is the best way to set a register to zero in x86 assembly: xor, mov or and? and Can x86's MOV really be "free"? Why can't I reproduce this at all?)


    If you're going to do OoO exec at all, you might as well go all-in, so AFAIK nothing modern does just scoreboarding instead of register renaming. (Except for in-order cores that scoreboard loads, so cache-miss latency doesn't stall until a later instruction actually reads the load's target register.)

    There are still in-order execution cores that don't do either, leaving instruction scheduling / software-pipelining up to compilers / humans. aka statically scheduled. This is not rare; widely used budget smartphone chips use cores like ARM Cortex-A53. Most programs bottleneck on memory, and you can allow some memory-level parallelism in an in-order core, especially with a store buffer.

    Sometimes energy per computation is more important than performance.