I have a fixed dataset of 10 signed words in A and B, plus two signed thresholds a and b. In one pass I need to
The code below works and already uses DBF
for exactly ten iterations and passes everything in/out on the stack.
My question is purely about micro-optimising the inner loop: is there a 68k idiom that lets me avoid one of the two Bcc
instructions and still keep the counts correct?
ORG $8000
A DC.W 2,2,1,0,4,4,2,5,-1,2
B DC.W 2,3,1,2,-7,4,-3,-2,0,2
C DS.W 10
aVal DC.W -3
bVal DC.W 5
CONT1 DS.W 1
CONT2 DS.W 1
START:
MOVE.W bVal,-(A7) ; push b (value)
MOVE.W aVal,-(A7) ; push a
PEA.L C ; push &C
PEA.L B ; push &B
PEA.L A ; push &A
BSR.S Process10
ADDA.L #12,A7 ; pop params
SIMHALT
; -------- Process10(&A,&B,&C,a,b) --------
; 4(A7)=&A 8(A7)=&B 12(A7)=&C 16=a 18=b
; OUT: CONT1 (A>D b), CONT2 (B<a)
Process10:
MOVE.W 16(A7),D2 ; a
MOVE.W 18(A7),D3 ; b
MOVEA.L 4(A7),A0 ; A
MOVEA.L 8(A7),A1 ; B
MOVEA.L 12(A7),A2 ; C
MOVEQ #9,D7 ; DBF count (10 words)
CLR.W D4 ; CONT1
CLR.W D5 ; CONT2
Loop10:
MOVE.W (A0)+,D0 ; A[i]
CMP.W D3,D0 ; A > b ?
BLE.S SkipA
ADDQ.W #1,D4
SkipA:
MOVE.W (A1)+,D1 ; B[i]
CMP.W D2,D1 ; B < a ?
BGE.S SkipB
ADDQ.W #1,D5
SkipB:
ADD.W D1,D0
MOVE.W D0,(A2)+
DBF D7,Loop10
MOVE.W D4,CONT1
MOVE.W D5,CONT2
RTS
END START
Inside Loop10 I use two conditional branches (BLE.S
and BGE.S
) to update CONT1
and CONT2
.
On 68k is there a way to:
Scc
+ ADDQ
or some other trick to fold each compare/update into one instruction,(If the answer is simply “your pair of branches is already the best trade-off”, that’s fine—I just want to be sure I’m not missing a classic 68k idiom.)
MOVEQ #0
instead of CLR.L
, plus minor tidy-upsFollowing @Peter Cordes’ timing notes:
MOVEQ #0,Dn
(4 cycles) is the quickest way to zero a 32-bit register, so CONT1/2
are now initialised with MOVEQ
.Scc / SUB.B
logic stays, because it’s already the cheapest “-1 or 0 ➜ +1 or 0” in a single instruction after the Scc
. ORG $8000
; ---------- data ----------
VectorA DC.W 2,2,1,0,4,4,2,5,-1,2
VectorB DC.W 2,3,1,2,-7,4,-3,-2,0,2
C DS.W 10
aVal DC.W -3
bVal DC.W 5
CONT1 DS.W 1
CONT2 DS.W 1
; ---------- main ----------
START:
MOVE.W bVal,-(A7)
MOVE.W aVal,-(A7)
PEA.L C
PEA.L VectorB
PEA.L VectorA
BSR.S Process10
ADDA.L #12,A7
MOVE.W D4,CONT1
MOVE.W D5,CONT2
SIMHALT
; ------------------------------------------------------------
; Process10(&A,&B,&C,a,b) — branch-free counters
; ------------------------------------------------------------
Process10:
MOVE.W 16(A7),D2 ; a
MOVE.W 18(A7),D3 ; b
MOVEA.L 4(A7),A0 ; A
MOVEA.L 8(A7),A1 ; B
MOVEA.L 12(A7),A2 ; C
MOVEQ #9,D7 ; 10 iterations
MOVEQ #0,D4 ; CONT1
MOVEQ #0,D5 ; CONT2
Loop10:
MOVE.W (A0)+,D0
CMP.W D3,D0 ; A > b ?
SGT D6 ; $00 / $FF
SUB.B D6,D4 ; CONT1 += 1/0
MOVE.W (A1)+,D1
CMP.W D2,D1 ; B < a ?
SLT D6 ; $00 / $FF
SUB.B D6,D5 ; CONT2 += 1/0
ADD.W D1,D0 ; C[i] = A + B
MOVE.W D0,(A2)+
DBF D7,Loop10
RTS
END START
Now both counters initialise in the fastest possible way (MOVEQ #0
) and the inner loop still contains only one branch (DBF
).
... passes everything in/out on the stack.
The information about the number of array elements is still hardcoded in the subroutine! And since normally the caller has no knowledge about how the subroutine is going to operate, you would need to pass the actual number of elements (so not a decremented-by-one value as would be fit for using dbf
).
ADDA.L #12,A7 ; pop params SIMHALT
This is wrong of course! You have pushed 2 words (is 4 bytes) and 3 longwords (is 12 bytes), so the correct instruction here would be ADDA.L #16,A7
. Because of the immediately following SIMHALT
, there's no problem in this little demo but in a regular program, who will tell?
Now both counters initialize in the fastest possible way (
MOVEQ #0
)
Both CONT1 and CONT2 are word-sized variables. Clearing the low words of the corresponding D4 and D5 registers will execute just as fast with the CLR.W D4
and CLR.W D5
instructions (and saves you 6 keypresses!).
START:
MOVE.W #10,-(A7)
MOVE.W bVal,-(A7)
MOVE.W aVal,-(A7)
PEA.L C
PEA.L VectorB
PEA.L VectorA
BSR.S Process10 ; -> D4 D5 (A0 A1 A2 D0 D1 D2 D3 D6 D7)
ADDA.L #18,A7
MOVE.W D4,CONT1
MOVE.W D5,CONT2
SIMHALT
; ------------------------------------------------------------
; Process10(&A,&B,&C,a,b,n) — branch-free counters
; ------------------------------------------------------------
Process10:
MOVE.W 20(A7),D7 ; n
SUBQ.W #1,D7 ; b/c DBRA
MOVE.W 18(A7),D3 ; b
MOVE.W 16(A7),D2 ; a
MOVEA.L 12(A7),A2 ; C
MOVEA.L 8(A7),A1 ; B
MOVEA.L 4(A7),A0 ; A
CLR.W D4 ; CONT1
CLR.W D5 ; CONT2
Loop10:
MOVE.W (A0)+,D0
CMP.W D3,D0 ; A > b ?
SGT D6 ; $00 / $FF
SUB.B D6,D4 ; CONT1 += 1/0
MOVE.W (A1)+,D1
CMP.W D2,D1 ; B < a ?
SLT D6 ; $00 / $FF
SUB.B D6,D5 ; CONT2 += 1/0
ADD.W D1,D0 ; C[i] = A + B
MOVE.W D0,(A2)+
DBRA D7,Loop10
RTS
The question about whether branchless code is faster, is easiest answered by looking at the timings:
cc = FALSE cc = TRUE
========== =========
BLE.S SkipA 8 10
ADDQ.W #1,D4 4
-- --
12 10
SGT D6 4 6
SUB.B D6,D4 4 4
-- --
8 10
So, taking the branch is not worse than the branchless solution. Only when the branch would not have to be taken and consequently that extra ADDQ
instruction would run, will the branchless solution be a bit faster.
Only don't go overboard. There's also the surrounding code to consider, as well as the probability of the condition code being false. The total gain might not be all that spectacular...
And don't forget that changing towards very large arrays is dead easy in the branchy version, not to mention that it requires one data register less.