I'm was writing some MIPS code for college, to see how functions within functions work, and everything worked fine at first. I'm using the WinMIP64 simulator.
Then, after I turned BTB on, everything was breaking (it got stuck in an infinite loop in the second function).
I was going crazy until realized it was because of BTB (there was a b in one of the functions and I wanted to reduce some of the Branch Taken Stalls that appeared as a result). When I switched it off, everything worked fine again.
I include some of the code below.
.data
tabla: .byte 1,4,5
res: .space 3
cont: .word 3
num: .word 0
.text
daddi $a0, $0, tabla # offset element table
daddi $a1, $0, res # offset results table
lb $a2, cont($0) # $a2 = 3 (array size)
daddi $sp, $0, 0x400 # $sp = 0x400
jal dobles # $ra = 0x14
sd $v1, num($0) # offset element count
halt
dobles: #first function
daddi $sp, $sp, -8 # make space in stack $sp = 0x3f8
sd $ra, 0($sp) # 0x3f8 = $ra (0x14)
loop:
lb $s0, 0($a0) # saving element from table in $s0
daddi $a0, $a0, 1 # add 1 byte displacement to $a0
daddi $sp, $sp, -8 # $sp = 0x3f0
sd $s0, 0($sp) # 0x3f0 = tabla element
jal multi # $ra = 0x38
sb $v0, 0($a1) # saving result to res
daddi $a1, $a1, 1 # displacement + 1 byte
daddi $a2, $a2, -1 # counter -1
bnez $a2, loop # loop till counter is 0
ld $ra, 0($sp) # load $ra from stack
daddi $sp, $sp, 8
jr $ra
multi: # second function
ld $t0, 0($sp) # load element from stack
daddi $sp, $sp, 8
daddi $v1, $v1, 1 # count numer of elements
dadd $v0, $t0, $t0 # element * 2
jr $ra
Why does this happen? Does the call to a function have some sort of effect on the buffer (I thought it was just for branches)? Is it possible to have calls to functions within functions and not have problems if I have the BTB on? What do I need to change if I want to use BTB and function calls within functions?
This was not covered in our program, so I am asking here.
BTB is a branch-prediction structure. It has zero effect on correctness, only performance. It's not architecturally visible.
My guess was the same a Jester's: you actually (also?) enabled the architectural branch delay slot: the instruction after a bXX
or j/jXX
instruction runs whether or not the branch is taken, hiding branch latency on early MIPS (short in-order pipeline, not superscalar).
But actually I don't see anything in your code that would break with or without a branch-delay slot. Jester tested and found that jal multi
sets $ra
to multi
on the second execution; that's an emulator bug. No correct execution of your code can set $ra
that way, with or without branch-delay slots.
According to the WinMIPS64 page
A delay slot can be implemented if desired. With V1.30 a simple branch-target-buffer can also be simulated. A << in the code window beside a jump or branch instruction indicates that it is predicted as being taken.
Perhaps the GUI ties together the BTB and delay-slot options?
As always, single-step your code in the debugger to see how it executes.
If you're sure that WinMIP64 is simulating with BTB but without branch-delay slots, then possibly you've found a bug in WinMIP64 itself. Since the BTB (and branch prediction in general) is not architecturally visible1, you code must run the same with or without it.
(Unless you did something that the MIPS ISA allows to cause "unpredictable behaviour", like putting two branches back to back, or modifying the inputs of a mult
instruction within a couple instructions after executing it. Or for classic MIPS I, using the result of a load too early, in the load delay slot.)
Footnote 1: outside of Spectre: using a side channel to make microarchitectural state architecturally visible.