assemblyoptimizationmicro-optimization68000branchless

68000 Assembly – Is branchless code faster for counting signed compare conditions?


Context

I have a fixed dataset of 10 signed words in A and B, plus two signed thresholds a and b. In one pass I need to

  1. store C[i] = A[i] + B[i]
  2. accumulate CONT1 += (A[i] > b)
  3. accumulate CONT2 += (B[i] < a)

The code below works and already uses DBF for exactly ten iterations and passes everything in/out on the stack. My question is purely about micro-optimising the inner loop: is there a 68k idiom that lets me avoid one of the two Bcc instructions and still keep the counts correct?

Minimal working program (Easy68k)

          ORG    $8000
A        DC.W  2,2,1,0,4,4,2,5,-1,2
B        DC.W  2,3,1,2,-7,4,-3,-2,0,2
C        DS.W  10
aVal     DC.W  -3
bVal     DC.W   5
CONT1    DS.W  1
CONT2    DS.W  1

START:
    MOVE.W  bVal,-(A7)      ; push b (value)
    MOVE.W  aVal,-(A7)      ; push a
    PEA.L   C               ; push &C
    PEA.L   B               ; push &B
    PEA.L   A               ; push &A
    BSR.S   Process10
    ADDA.L  #12,A7          ; pop params
    SIMHALT

; -------- Process10(&A,&B,&C,a,b) --------
; 4(A7)=&A  8(A7)=&B  12(A7)=&C  16=a  18=b
; OUT: CONT1 (A>D b), CONT2 (B<a)
Process10:
    MOVE.W 16(A7),D2        ; a
    MOVE.W 18(A7),D3        ; b
    MOVEA.L 4(A7),A0        ; A
    MOVEA.L 8(A7),A1        ; B
    MOVEA.L 12(A7),A2       ; C
    MOVEQ   #9,D7           ; DBF count (10 words)
    CLR.W   D4              ; CONT1
    CLR.W   D5              ; CONT2
Loop10:
    MOVE.W  (A0)+,D0        ; A[i]
    CMP.W   D3,D0           ; A > b ?
    BLE.S   SkipA
    ADDQ.W  #1,D4
SkipA:
    MOVE.W  (A1)+,D1        ; B[i]
    CMP.W   D2,D1           ; B < a ?
    BGE.S   SkipB
    ADDQ.W  #1,D5
SkipB:
    ADD.W   D1,D0
    MOVE.W  D0,(A2)+
    DBF     D7,Loop10
    MOVE.W  D4,CONT1
    MOVE.W  D5,CONT2
    RTS
          END  START

Question

Inside Loop10 I use two conditional branches (BLE.S and BGE.S) to update CONT1 and CONT2. On 68k is there a way to:

(If the answer is simply “your pair of branches is already the best trade-off”, that’s fine—I just want to be sure I’m not missing a classic 68k idiom.)


Edit — MOVEQ #0 instead of CLR.L, plus minor tidy-ups

Following @Peter Cordes’ timing notes:

          ORG    $8000
; ---------- data ----------
VectorA   DC.W  2,2,1,0,4,4,2,5,-1,2
VectorB   DC.W  2,3,1,2,-7,4,-3,-2,0,2
C         DS.W  10
aVal      DC.W  -3
bVal      DC.W   5
CONT1     DS.W   1
CONT2     DS.W   1

; ---------- main ----------
START:
    MOVE.W  bVal,-(A7)
    MOVE.W  aVal,-(A7)
    PEA.L   C
    PEA.L   VectorB
    PEA.L   VectorA
    BSR.S   Process10
    ADDA.L  #12,A7
    MOVE.W  D4,CONT1
    MOVE.W  D5,CONT2
    SIMHALT

; ------------------------------------------------------------
; Process10(&A,&B,&C,a,b)  — branch-free counters
; ------------------------------------------------------------
Process10:
    MOVE.W  16(A7),D2          ; a
    MOVE.W  18(A7),D3          ; b
    MOVEA.L  4(A7),A0          ; A
    MOVEA.L  8(A7),A1          ; B
    MOVEA.L 12(A7),A2          ; C
    MOVEQ   #9,D7              ; 10 iterations
    MOVEQ   #0,D4              ; CONT1
    MOVEQ   #0,D5              ; CONT2
Loop10:
    MOVE.W  (A0)+,D0
    CMP.W   D3,D0              ; A > b ?
    SGT     D6                 ; $00 / $FF
    SUB.B   D6,D4              ; CONT1 += 1/0

    MOVE.W  (A1)+,D1
    CMP.W   D2,D1              ; B < a ?
    SLT     D6                 ; $00 / $FF
    SUB.B   D6,D5              ; CONT2 += 1/0

    ADD.W   D1,D0              ; C[i] = A + B
    MOVE.W  D0,(A2)+

    DBF     D7,Loop10
    RTS
          END  START

Now both counters initialise in the fastest possible way (MOVEQ #0) and the inner loop still contains only one branch (DBF).


Solution

  • ... passes everything in/out on the stack.

    The information about the number of array elements is still hardcoded in the subroutine! And since normally the caller has no knowledge about how the subroutine is going to operate, you would need to pass the actual number of elements (so not a decremented-by-one value as would be fit for using dbf).

    ADDA.L  #12,A7          ; pop params
    SIMHALT
    

    This is wrong of course! You have pushed 2 words (is 4 bytes) and 3 longwords (is 12 bytes), so the correct instruction here would be ADDA.L #16,A7. Because of the immediately following SIMHALT, there's no problem in this little demo but in a regular program, who will tell?

    Now both counters initialize in the fastest possible way (MOVEQ #0)

    Both CONT1 and CONT2 are word-sized variables. Clearing the low words of the corresponding D4 and D5 registers will execute just as fast with the CLR.W D4 and CLR.W D5 instructions (and saves you 6 keypresses!).

    START:
        MOVE.W  #10,-(A7)
        MOVE.W  bVal,-(A7)
        MOVE.W  aVal,-(A7)
        PEA.L   C
        PEA.L   VectorB
        PEA.L   VectorA
        BSR.S   Process10    ; -> D4 D5 (A0 A1 A2 D0 D1 D2 D3 D6 D7)
        ADDA.L  #18,A7
        MOVE.W  D4,CONT1
        MOVE.W  D5,CONT2
        SIMHALT
    
    ; ------------------------------------------------------------
    ; Process10(&A,&B,&C,a,b,n)  — branch-free counters
    ; ------------------------------------------------------------
    Process10:
        MOVE.W  20(A7),D7          ; n
        SUBQ.W  #1,D7              ; b/c DBRA
        MOVE.W  18(A7),D3          ; b
        MOVE.W  16(A7),D2          ; a
        MOVEA.L 12(A7),A2          ; C
        MOVEA.L 8(A7),A1           ; B
        MOVEA.L 4(A7),A0           ; A
        CLR.W   D4                 ; CONT1
        CLR.W   D5                 ; CONT2
    Loop10:
        MOVE.W  (A0)+,D0
        CMP.W   D3,D0              ; A > b ?
        SGT     D6                 ; $00 / $FF
        SUB.B   D6,D4              ; CONT1 += 1/0
    
        MOVE.W  (A1)+,D1
        CMP.W   D2,D1              ; B < a ?
        SLT     D6                 ; $00 / $FF
        SUB.B   D6,D5              ; CONT2 += 1/0
    
        ADD.W   D1,D0              ; C[i] = A + B
        MOVE.W  D0,(A2)+
    
        DBRA    D7,Loop10
        RTS
    

    The question about whether branchless code is faster, is easiest answered by looking at the timings:

                            cc = FALSE    cc = TRUE
                            ==========    =========
        BLE.S   SkipA          8             10
        ADDQ.W  #1,D4          4
                              --             --
                              12             10
    
        SGT     D6             4              6
        SUB.B   D6,D4          4              4
                              --             --
                               8             10
    

    So, taking the branch is not worse than the branchless solution. Only when the branch would not have to be taken and consequently that extra ADDQ instruction would run, will the branchless solution be a bit faster.

    Only don't go overboard. There's also the surrounding code to consider, as well as the probability of the condition code being false. The total gain might not be all that spectacular...

    And don't forget that changing towards very large arrays is dead easy in the branchy version, not to mention that it requires one data register less.