cpu-architectureparallelism-amdahl

Processor Speedup Calculation Difference


I am trying to understand why Amdahl's law does not apply to this situation?

Let's say we have two configs

Config 1 has L1 access latency (hitDelay and missDelay as 1 cycle) Config 2 has L1 access latency as 7 cycles.

Assuming load and store are 30% of the processor time, we expect a speed up of 1/(0.7 + 0.3/7) = 1.35

However, when I run the two configurations on an actual simulator that simulates a ROB based processor and calculate speedups using cycles, the speed up is 1.12.

Why is the speed up different from what was calculated using Amdahl's law. I am thinking Amdahl's law does not apply because of some reason?


Solution

  • Amdahl's "law" assumes times for separate parts don't overlap.

    The whole point of a ROB for out-of-order exec is to find instruction-level parallelism and hide latency. That's why performance of a whole sequence of instructions is not the sum of any single "cost" number for each instruction separately, except on the simplest CPUs.

    e.g. What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? (modern x86 makes it even more complex by each instruction possibly having different front-end cost in uops, but even with a simpler RISC machine, you still have back-end port pressure vs. possible latency bottlenecks).