Why does Branch Target Buffer affect return from function calls? [WinMIPS64]

I'm was writing some MIPS code for college, to see how functions within functions work, and everything worked fine at first. I'm using the WinMIP64 simulator.

Then, after I turned BTB on, everything was breaking (it got stuck in an infinite loop in the second function).

I was going crazy until realized it was because of BTB (there was a b in one of the functions and I wanted to reduce some of the Branch Taken Stalls that appeared as a result). When I switched it off, everything worked fine again.

I include some of the code below.

.data
tabla:  .byte 1,4,5
res:    .space 3 
cont:   .word 3
num:    .word 0

.text

       daddi $a0, $0, tabla  # offset element table
       daddi $a1, $0, res    # offset results table
       lb $a2, cont($0)      # $a2 = 3 (array size)
       daddi $sp, $0, 0x400  # $sp = 0x400
       jal dobles            # $ra = 0x14
       sd $v1, num($0)       # offset element count
       halt

dobles:                      #first function 

       daddi $sp, $sp, -8    # make space in stack $sp = 0x3f8
       sd $ra, 0($sp)        # 0x3f8 = $ra (0x14) 

loop:
       lb $s0, 0($a0)        # saving element from table in $s0
       daddi $a0, $a0, 1     # add 1 byte displacement to $a0  

       daddi $sp, $sp, -8    # $sp = 0x3f0
       sd $s0, 0($sp)        # 0x3f0 = tabla element 

       jal multi             # $ra = 0x38 
       sb $v0, 0($a1)        # saving result to res
       daddi $a1, $a1, 1     # displacement + 1 byte 

       daddi $a2, $a2, -1    # counter -1 
       bnez $a2, loop        # loop till counter is 0 

       ld $ra, 0($sp)        # load $ra from stack
       daddi $sp, $sp, 8     
       jr $ra 

multi:                       # second function
       ld $t0, 0($sp)        # load element from stack
       daddi $sp, $sp, 8     
       daddi $v1, $v1, 1     # count numer of elements
       dadd $v0, $t0, $t0    # element * 2
       jr $ra

Why does this happen? Does the call to a function have some sort of effect on the buffer (I thought it was just for branches)? Is it possible to have calls to functions within functions and not have problems if I have the BTB on? What do I need to change if I want to use BTB and function calls within functions?

This was not covered in our program, so I am asking here.

Solution

BTB is a branch-prediction structure. It has zero effect on correctness, only performance. It's not architecturally visible.

My guess was the same a Jester's: you actually (also?) enabled the architectural branch delay slot: the instruction after a bXX or j/jXX instruction runs whether or not the branch is taken, hiding branch latency on early MIPS (short in-order pipeline, not superscalar).

But actually I don't see anything in your code that would break with or without a branch-delay slot. Jester tested and found that jal multi sets $ra to multi on the second execution; that's an emulator bug. No correct execution of your code can set $ra that way, with or without branch-delay slots.

According to the WinMIPS64 page

A delay slot can be implemented if desired. With V1.30 a simple branch-target-buffer can also be simulated. A << in the code window beside a jump or branch instruction indicates that it is predicted as being taken.

Perhaps the GUI ties together the BTB and delay-slot options?

As always, single-step your code in the debugger to see how it executes.

If you're sure that WinMIP64 is simulating with BTB but without branch-delay slots, then possibly you've found a bug in WinMIP64 itself. Since the BTB (and branch prediction in general) is not architecturally visible¹, you code must run the same with or without it.

(Unless you did something that the MIPS ISA allows to cause "unpredictable behaviour", like putting two branches back to back, or modifying the inputs of a mult instruction within a couple instructions after executing it. Or for classic MIPS I, using the result of a load too early, in the load delay slot.)

Footnote 1: outside of Spectre: using a side channel to make microarchitectural state architecturally visible.