I am preparing for a test and have such example. Following code:
1: SLL $1, $1, 2
2: LW $2, 1000($1)
3: BEQL $2, $0, END
4: ADDI $3, $2, 1
5: MULT $3, $2
6: MFLO $4
END:
7: J QUIT
...
QUIT:
100: NOP
is executed on RISC processor (with quasi MIPS instruction set) with
My task is to understand how the Branch Delay Slot works in this situation and build the correct Pipeline Diagram.
I have an official solution and it gives the following diagram with no explanation:
1: SLL $1, $1, 2 IDEMW
2: LW $2, 1000($1) I---DEMW
3: BEQL $2, $0, END I---DEMW
4: ADDI $3, $2, 1 IDx
5: MULT $3, $2 IDEMW
6: MFLO $4 I---DEMW
As far as I understand, ADDI is executed in Branch Delay Slot and is stopped after processor understands, that branch is not taken, what leads us to wrong result. My Questions here are
The CPU keeps reading instructions sequentially, i.e. during execution (was already fetched, decoded and the remaining phases are now being processed, I don't know your exact phases, so this is just general description) of beql
it will get the other part of pipe free to fetch next instruction, but the branch was not finished yet, so the PC
is still pointing to the next instruction after branch -> that's the "branch delay slot".
On classic MIPS this next instruction is fetched, decoded, and executed, and meanwhile the branch may or may not modify the PC to the branch target, so the branch-delay slot instruction will get executed every time. The next instruction after it gets executed only when branching didn't happen, i.e. the PC
continues after the "branch delay slot" position sequentially. In case the branch did modify the PC
, the fetch+decode will take notice and decode the next instruction from new destination, so on classic MIPS the branch delay slot is only 1 instruction "big" (I have no idea if more complex MIPS CPUs can have more stages and more delay slots available, technically with 5 stage pipeline even 5 instructions delayed sounds HW possible, but it would be probably very difficult to use practically and sounds like it would create more problems than help).
The BEQL
is more complex instruction, killing the delay slot instruction halfway into execution if the branching condition fails.
See http://math-atlas.sourceforge.net/devel/assembly/mips-iv.pdf page 45 for detailed description of BEQL
.
So that "NullifyCurrentInstruction()" is probably that "x" in the diagram. Remaining things in diagram, I'm just guessing as I didn't study your 5 stage details, but second LW
after fetching and decoding(?) finds out it depends on $1
, so it waits in depend-stage for previous instruction W
phase. Etc...The ADDI
doesn't depend on anything, so it is executed almost parallel to BEQL
, and gets killed toward the end.
But I don't understand why there's no "I" phase every time the "I" stage gets freed, looks like the "I" waits for something and in the end you have like at most 2 instructions going on at the same time.
Anyway, this is quite undecipherable without studying the technical details of the CPU used in your question, and I don't want to study it, I'm not even sure what kind of CPU you have, and where to get it's technical documentation.
edit: I will try to extract relevant part of pdf also here, to make this answer not "just link", but copying pdf may be tricky...
BEQL
instruction docs of MIPS IV CPU:
Description: if (
rs
=rt
) then branch_likely
An 18-bit signed offset (the 16-bit offset field shifted left 2 bits) is added to the address of the instruction following the branch (not the branch itself), in the branch delay slot, to form a PC-relative effective target address.
If the contents of GPRrs
and GPRrt
are equal, branch to the target address after the instruction in the delay slot is executed. If the branch is not taken, the instruction in the delay slot is not executed.Operation:
I:
tgt_offset ← sign_extend(offset || 02)
condition ← (GPR[rs] = GPR[rt])
I+1:
if condition then
PC ← PC + tgt_offset
else
NullifyCurrentInstruction()
endif