Consider the following sequence of MIPS instructions
lw $4, 0($1)
addi $5, $4, 10
add $2, $2, $3
lw $1, 20($2)
add $1, $1, $5
(a) How many clock cycles will it take to fully execute these instructions on a pipelined CPU if no forwarding is used?
(b) How many clock cycles will it take to fully execute these instructions on a pipelined CPU if full forwarding is used?
(c) Show how we can reduce the number of cycles even further by using a delayed load.
This is not a homework question
The right answer should be
a)15 cycles
b)11 cycles
c)9 cycles
I don't understand how to get these answers, when I write out the chart with clock cycles and stage in the pipeline, I get 12, 5, and 10.
Without forwarding, a second instruction with a RAW hazard has to repeat in the ID stage until the first instruction is in the WB stage.
It is generally accepted (but doesn't necessarily have to be true) that the ID stage can read the WB stage's results in the same cycle, reasons are written up elsewhere here at SO.
cycle | IF | ID | EX | MEM | WB |
---|---|---|---|---|---|
1 | lw | ||||
2 | addi | lw | |||
3 | add | addi | lw | ||
4 | add* | addi* | lw | ||
5 | add* | addi* | lw | ||
6 | lw | add | addi | ||
7 | add | lw | add | addi | |
8 | add* | lw* | add | addi | |
9 | add* | lw* | add | ||
10 | add | lw | |||
11 | add* | lw | |||
12 | add* | lw | |||
13 | add | ||||
14 | add | ||||
15 | add |
The *'s indicate necessary repeats, and the blanks indicate where no useful work can be done. Both are effectively stalls, but for slightly different reasons.
The addi
instruction cannot leave the ID stage until the preceding lw
is in WB, that's when it will get the proper values via ID stage.
This show 15 cycles for that sequence, without forwarding.
For the sequence with full forwarding, I count 7 cycles. While ALU RAW hazards can be entirely mitigated with EX->EX forwarding, load-use RAW hazards cannot and these cost one cycle stall while still requiring MEM->EX forwarding, because the data has to travel 2 stages back in the pipeline (it is not the travelling itself but the timing of when finally available). This is because the target data of interest is produced in MEM stage, but needed by the EX stage. So, the use instruction of load-use has to wait or repeat in EX until the MEM stage for the load has fully completed, then EX can finally run for that use instruction in the next cycle.
By reordering the instruction sequence, we can cover one of the load-use stalls. In this short code sequence, there are not enough instructions to cover both load-use stalls, so covering one is the best we can do:
lw $4, 0($1)
add $2, $2, $3
addi $5, $4, 10
lw $1, 20($2)
add $1, $1, $5
Now it is 6 cycles, with full forwarding.