assemblymipscpu-architecturemicro-optimizationloop-unrolling

How to unroll a loop of a dot product in mips after re-ordering instructions?


I got this question about loop unrolling in mips but I could not figure out how once I got to the step I will show you below, I am not even sure about this steps. I am new to computer Arch, I just had this code snippet which is in assembly:

Loop: ld $f0, 0($s1) 
      ld $f4, 0($s2) 
      multd $f0, $f0, $f4 
      addd $f2, $f0, $f2 
      subi $s1, $s1, 8 
      subi $s2, $s2, 8 
      bneqz $s1, Loop

There is additional information given and they are as follows:

Instruction producing result Instruction using result Latency in clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store Double 2
Load double FP ALU op 1
Load double Store double 0

So what I did is insert the stalls then try to reorder the instructions:

Loop: ld $f0, 0($s1) 
      ld $f4, 0($s2)
      STALL
      multd $f0, $f0, $f4
      STALL
      STALL
      STALL 
      addd $f2, $f0, $f2 
      subi $s1, $s1, 8 
      subi $s2, $s2, 8 
      bneqz $s1, Loop

I moved the addd instruction after the bneqz instruction then I got stuck, can anyone help explain?


Solution

  • Without unrolling or doing anything tricky, you can hide all but one stall cycle.

    # $f0 = 0.0
    # $t0 = end_pointer = $t1+size 
    Loop: ld    $f6, 0($t1)
          ld    $f8, 0($t2)
          addiu $t1, $s1, 8         # fills load-delay slot
    
          multd $f6, $f6, $f8
          addiu $t2, $s2, 8         # multd in flight cycle 1
    
          bneq $t1, $t0, Loop       # multd in flight cycle 2
                                 STALL  waiting for $f6
          addd $f0, $f0, $f6
    

    There are only 2 instructions between multd and addd. With software pipelining, we can avoid that stall without unrolling. You might also call this a loop rotation, since the 1-iteration window of the loop that appears in the source doesn't start with a load and doesn't end with an addd. Loop rotation enables software-pipelining (putting the loads in the shadow of the multd to set up for the next iteration).

          ld    $f6, 0($t1)    # partially peeled first iteration
          ld    $f8, 0($t2)
    Loop:
          # this iteration's loads are already done
          multd  $f4, $f6, $f8   # into a different reg so loads won't overwrite it
          ld     $f6, 8($t1)
          ld     $f8, 8($t2)     # offset=8 for next iter before the pointer increment
          addiu  $t1, $s1, 8 
          addiu  $t2, $s2, 8
    
          bneq   $t1, $t0, Loop
          addd   $f0, $f0, $f4
    # loop ends with pointers pointing one-past-end of the arrays
    
          multd  $f4, $f6, $f8
          addd   $f0, $f0, $f4
    

    Having the multiply write to a temp register instead of overwriting one of the registers we load into is essentially avoiding a WAW hazard to make this manual reordering possible. Having lots of architectural registers enables plenty of software-pipelining even for longer dependency chains. Speaking of which:

    On higher-performance CPUs, that loop-carried dependency would become a bottleneck and you'd need to unroll with multiple accumulators to hide it. e.g. Intel Haswell can run two FMAs (Fused Multiply-Add) per clock, with 5 cycle latency, so you need at least 10 accumulators (registers to sum into) to hide that latency. Or fewer in a dot product since you bottleneck on 2 loads per clock; each FMA needs two loads so you bottleneck on that. (Superscalar out-of-order exec of up to 4 front-end micro-ops per clock, with x86 being able to combine an FMA + load into one uop for the pipeline leaving room for pointer increments. Very different from a 1-IPC MIPS pipeline where even without unrolling, you don't hit any inter-iteration bottlenecks.)


    With unrolling, you don't need software-pipelining on this pipeline:

    Loop: ld $f6, 0($t1)
          ld $f8, 0($t2)
          ld $f10, 8($t1)
          ld $f12, 8($t2)
    
          multd $f6, $f6, $f8           # 2 insn gap after loading $f8
          multd $f10, $f10, $f12        # 1 insn gap after loading $f12
          addiu $t1, $s1, 16
          addiu $t2, $s2, 16
    
          addd $f0, $f0, $f6            # 3 insn gap after multd into $f6
          bneq $t1, $t0, Loop       # 2 insn gap after pointer increment
          addd $f2, $f2, $f10           # 4 insn gap after multd into $f10
    
    # after the loop: reduce the two accumulators to one
          addd  $f0, $f0, $f2       # runs once per dot product, not per element
                                    # so a stall hardly matters if you can't schedule it after other post-loop work.
    

    We could almost hide FP latency without multiple accumulators or software pipelining, to avoid that extra addd outside the loop, if we schedule to the limits of load latency. Since we're mixing things around, I indented differently for the instructions associated with the two different mul/add ops.

    Loop:  ld $f6, 0($t1)
           ld $f8, 0($t2)
             ld $f10, 8($t1)
           multd $f6, $f6, $f8         # 1 insn gap after loading $f8
             ld $f12, 8($t2)
         addiu $t1, $s1, 16
    
             multd $f10, $f10, $f12       # 1 insn gap after loading $f12
           addd $f0, $f0, $f6          # 3 insn gap after multd into $f6
         addiu $t2, $s2, 16
    
         bneq $t1, $t0, Loop
             addd $f0, $f0, $f10          # STALL: 2 INSN GAP after addd into $f0
    

    So no, it turns out we can't hide all the latency this way, without multiple accumulators or software pipelining.

    If you had to match the numeric results of purely serial code (like compiling without -ffast-math, strict-FP), you could software-pipeline this to hide that one stall cycle, rotating the loop so the loads are filling gaps after mults and adds.

    But multiple accumulators are typically better numerically, assuming uniformly distributed non-negative things you're summing, so you're adding the same size small numbers to ever-larger values, with worse-and-worse rounding error. See Simd matmul program gives different numerical results - different from sum += a[i] * b[i] doesn't mean worse; usually it's numerically more accurate. As in Is this way of processing the tail of an array with SSE overkill? where elements of SIMD vectors are multiple accumulators, we see less error than a simple serial sum.

    Using two accumulators costs only one extra addd outside the loop (beyond any checks for unrolling), vs. loop rotation peeling a whole iteration of the loop body (except pointer increments), part before the loop and part after.


    I made some other changes to your loop for sanity: