I’m practising fixed-size loops on a Motorola 68000. Given two 10-word signed arrays A and B, I need to
All parameters must be passed on the stack.
ORG $8000
VectorA DC.W 2,2,1,0,4,4,2,5,-1,2
VectorB DC.W 2,3,1,2,-7,4,-3,-2,0,2
C DS.W 10
LEN EQU 10 ; array size
START:
MOVE.W #LEN,-(A7) ; push n (word)
PEA.L C ; push &C
PEA.L VectorB ; push &B
PEA.L VectorA ; push &A
BSR.S SwapAndSumN
ADDA.L #14,A7 ; 3 longs (12) + 1 word (2) = 14
SIMHALT
; SwapAndSumN(&A,&B,&C,n) — one pass, branch-free counters
; 4(A7)=&A 8(A7)=&B 12(A7)=&C 16(A7)=n (word)
SwapAndSumN:
MOVE.W 16(A7),D7 ; D7 = n
SUBQ.W #1,D7 ; DBRA wants n–1
MOVEA.L 4(A7),A0 ; A0 = A
MOVEA.L 8(A7),A1 ; A1 = B
MOVEA.L 12(A7),A2 ; A2 = C
Loop:
MOVE.W (A0),D0 ; D0 = A[i]
MOVE.W (A1),D1 ; D1 = B[i]
MOVE.W D1,(A0)+ ; B → A
MOVE.W D0,(A1)+ ; old A → B
ADD.W D1,D0 ; sum
MOVE.W D0,(A2)+ ; C[i]
DBRA D7,Loop
RTS
END START
Swapping two words needs a temporary register, so I already read both elements once.
Is the two-loads / two-stores sequence above the best the 68000 can do, or is there a classic trick (e.g. EXG
, MOVE.L
on aligned data, MOVEM
, etc.) that would shave a cycle or a byte off this loop while still keeping everything word-aligned and using only the registers I have?
ORG $8000
; =============================================================
; 68000 — Pair-wise swap A[i]↔B[i] **and** store C[i] = A[i]+B[i]
; using Peter Cordes’ swap/move.l trick:
; add.w / swap / swap / add.w / swap / move.l
; ▸ one MOVE.L store per pair
; ▸ no extra MOVE.W, same cycle budget as 4× MOVE.W
; Arrays word-aligned, LEN even.
; =============================================================
VectorA DC.W 2,2,1,0,4,4,2,5,-1,2
VectorB DC.W 2,3,1,2,-7,4,-3,-2,0,2
C DS.W 10
LEN EQU 10
; ---------- main ----------
START:
MOVE.W #LEN,-(A7) ; n (word)
PEA.L C ; &C
PEA.L VectorB ; &B
PEA.L VectorA ; &A
BSR.S SwapSum_LongPairs ; do the work
ADDA.L #14,A7 ; pop 3 longs + 1 word
SIMHALT
; ------------------------------------------------------------
; SwapSum_LongPairs(&A,&B,&C,n)
; 4(A7)=&A 8(A7)=&B 12(A7)=&C 16(A7)=n (word, even)
; Registers: A0,A1,A2 / D0,D1,D7
; Inner sequence: LdA LdB | add.w | swap | swap | add.w | swap | move.l
; ------------------------------------------------------------
SwapSum_LongPairs:
MOVE.W 16(A7),D7 ; D7 = n
LSR.W #1,D7 ; pair-count
SUBQ.W #1,D7 ; DBRA wants count-1
MOVEA.L 4(A7),A0 ; A
MOVEA.L 8(A7),A1 ; B
MOVEA.L 12(A7),A2 ; C
PairLp:
MOVE.L (A0),D0 ; D0 = A_hi|A_lo
MOVE.L (A1),D1 ; D1 = B_hi|B_lo
MOVE.L D1,(A0)+ ; B → A
MOVE.L D0,(A1)+ ; A → B
ADD.W D1,D0 ; low-word sum → D0.low
SWAP D0 ; D0: sum_lo | A_hi
SWAP D1 ; D1: B_lo | B_hi
ADD.W D1,D0 ; low-word: A_hi + B_hi → sum_hi
SWAP D0 ; D0: sum_hi | sum_lo (big-endian order)
MOVE.L D0,(A2)+ ; write both words of C pair
DBRA D7,PairLp
RTS
END START
Thus the inner loop is now load-swap-add-swap-add-swap-store, no wasted moves, one DBRA
branch, and minimal memory traffic.
A lot of good advice in the comments! I especially found enticing the suggestion to use MOVEM
as a way to unroll this loop.
So, in order to find out just by how much each of the proposed solutions differ, I wrote the five following code snippets that all can accept any unsigned word-sized value for the number of array elements, nothing excluded. There should never be any problem with alignment because, as these array elements are words, they should already reside at even addresses. (.l
loads still only need 2-byte alignment; 68000 only has a 16-bit external data bus.)
Since it is not terribly interesting to want to optimize the code for a very short array, I have set the number of array elements at 60 so that the core loop at least can run several times. And 60 is also the smallest multiple of those elements-per-iteration values.
The numbers in the n=
columns are total cycle counts for calling the function with that many elements.
bytes | elem/iter | n=60 | % | n=61 | % | |
---|---|---|---|---|---|---|
38 | 1 | 3332 | 100.00 | 3386 | 100.00 | |
+SWAP | 62 | 2 | 2810 | 84.23 | 2852 | 84.23 |
+MOVEM | 84 | 5 | 2476 | 74.31 | 2532 | 74.78 |
+MOVEM | 90 | 6 | 2416 | 72.51 | 2472 | 73.01 |
+SWAP+MOVEM | 152 | 12 | 2454 | 73.65 | 2510 | 74.13 |
It does not pay to combine MOVEM
with SWAP
because an address register is not 'swappable'. Also, the codesize seems to get a bit out of hand.
One element per iteration:
n=60 n=61
SwapAndSum: ==== ====
movea.l 4(a7),a4 16 16 16
movea.l 8(a7),a5 16 16 16
movea.l 12(a7),a6 16 16 16
move.w 16(a7),d7 12 12 12
subq.w #1,d7 4 4 4
bcs.s .done 10/8 8 8
.loop:
move.w (a4),d0 8 480 488
move.w (a5),d1 8 480 488
move.w d1,(a4)+ 8 480 488
move.w d0,(a5)+ 8 480 488
add.w d1,d0 4 240 244
move.w d0,(a6)+ 8 480 488
dbra d7,.loop 10/14 604 614
.done:
rts 16 16 16
---- ----
3332 3386
Two elements per iteration:
n=60 n=61
SwapAndSum: ==== ====
movea.l 4(a7),a4 16 16 16
movea.l 8(a7),a5 16 16 16
movea.l 12(a7),a6 16 16 16
move.w 16(a7),d7 12 12 12
lsr.w #1,d7 8 8 8
bcc.s .neven 10/8 10 8
.nodd:
move.w (a4),d0 8 8
move.w (a5),d1 8 8
move.w d1,(a4)+ 8 8
move.w d0,(a5)+ 8 8
add.w d1,d0 4 4
move.w d0,(a6)+ 8 8
.neven:
subq.w #1,d7 4 4 4
bcs.s .done 10/8 8 8
.loop:
move.l (a4),d0 12 360 360
move.l (a5),d1 12 360 360
move.l d1,(a4)+ 12 360 360
move.l d0,(a5)+ 12 360 360
add.w d1,d0 4 120 120
swap d0 4 120 120
swap d1 4 120 120
add.w d1,d0 4 120 120
swap d0 4 120 120
move.l d0,(a6)+ 12 360 360
dbra d7,.loop 10/14 304 304
.done:
rts 16 16 16
---- ----
2810 2852
Five elements per iteration:
It is important to choose the registers carefully; adding A0 to D2 is cheaper than adding D2 to A0 (same for A1 and A2).
n=60 n=61
SwapAndSum: ==== ====
movea.l 4(a7),a4 16 16 16
movea.l 8(a7),a5 16 16 16
movea.l 12(a7),a6 16 16 16
move.w 16(a7),d7 12 12 12
subq.w #5,d7 4 4 4
bcs.s .final 10/8 8 8
.loopA:
movem.w (a4)+,d0-d4 32 384 384
movem.w (a5)+,d5-d6/a0-a2 32 384 384
movem.w d5-d6/a0-a2,-10(a4) 32 384 384
movem.w d0-d4,-10(a5) 32 384 384
add.w d5,d0 4 48 48
add.w d6,d1 4 48 48
add.w a0,d2 4 48 48
add.w a1,d3 4 48 48
add.w a2,d4 4 48 48
movem.w d0-d4,(a6) 28 336 336
lea 10(a6),a6 8 96 96
subq.w #5,d7 4 48 48
bcc.s .loopA 10/8 118 118
.final:
addq.w #4,d7 4 4 4
bmi.s .done 10/8 10 8
.loopB:
move.w (a4),d0 8 8
move.w (a5),d6 8 8
move.w d6,(a4)+ 8 8
move.w d0,(a5)+ 8 8
add.w d6,d0 4 4
move.w d0,(a6)+ 8 8
dbra d7,.loopB 10/14 14
.done:
rts 16 16 16
---- ----
2476 2532
Six elements per iteration:
It is important to choose the registers carefully; adding A0 to D2 is cheaper than adding D2 to A0 (same for A1, A2, and A3).
Because all 16 registers are taken, I use a memory-based counter. It is however better to not decrement the count at 16(a7)
, but instead push it anew so that using the cheaper (a7)
addressing is possible.
n=60 n=61
SwapAndSum: ==== ====
movea.l 4(a7),a4 16 16 16
movea.l 8(a7),a5 16 16 16
movea.l 12(a7),a6 16 16 16
move.w 16(a7),d7 12 12 12
subq.w #6,d7 4 4 4
bcs.s .final 10/8 8 8
move.w d7,-(a7) 8 8 8
.loopA:
movem.w (a4)+,d0-d5 36 360 360
movem.w (a5)+,d6-d7/a0-a3 36 360 360
movem.w d6-d7/a0-a3,-12(a4) 36 360 360
movem.w d0-d5,-12(a5) 36 360 360
add.w d6,d0 4 40 40
add.w d7,d1 4 40 40
add.w a0,d2 4 40 40
add.w a1,d3 4 40 40
add.w a2,d4 4 40 40
add.w a3,d5 4 40 40
movem.w d0-d5,(a6) 32 320 320
lea 12(a6),a6 8 80 80
subq.w #6,(a7) 12 120 120
bcc.s .loopA 10/8 98 98
move.w (a7)+,d7 8 8 8
.final:
addq.w #5,d7 4 4 4
bmi.s .done 10/8 10 8
.loopB:
move.w (a4),d0 8 8
move.w (a5),d6 8 8
move.w d6,(a4)+ 8 8
move.w d0,(a5)+ 8 8
add.w d6,d0 4 4
move.w d0,(a6)+ 8 8
dbra d7,.loopB 10/14 14
.done:
rts 16 16 16
---- ----
2416 2472
Twelve elements per iteration:
Because all 16 registers are taken, I use a memory-based counter. It is however better to not decrement the count at 16(a7)
, but instead push it anew so that using the cheaper (a7)
addressing is possible.
Because none of A0, A1, A2, or A3 is 'swappable', I first copy them to the by then available D6 register.
n=60 n=61
SwapAndSum: ==== ====
movea.l 4(a7),a4 16 16 16
movea.l 8(a7),a5 16 16 16
movea.l 12(a7),a6 16 16 16
move.w 16(a7),d7 12 12 12
subi.w #12,d7 8 8 8
bcs.s .final 10/8 8 8
move.w d7,-(a7) 8 8 8
.loopA:
movem.l (a4)+,d0-d5 60 300 300
movem.l (a5)+,d6-d7/a0-a3 60 300 300
movem.l d6-d7/a0-a3,-24(a4) 60 300 300
movem.l d0-d5,-24(a5) 60 300 300
add.w d6,d0 4 20 20
swap d0 4 20 20
swap d6 4 20 20
add.w d6,d0 4 20 20
swap d0 4 20 20
add.w d7,d1 4 20 20
swap d1 4 20 20
swap d7 4 20 20
add.w d7,d1 4 20 20
swap d1 4 20 20
move.l a0,d6 4 20 20
add.w d6,d2 4 20 20
swap d2 4 20 20
swap d6 4 20 20
add.w d6,d2 4 20 20
swap d2 4 20 20
move.l a1,d6 4 20 20
add.w d6,d3 4 20 20
swap d3 4 20 20
swap d6 4 20 20
add.w d6,d3 4 20 20
swap d3 4 20 20
move.l a2,d6 4 20 20
add.w d6,d4 4 20 20
swap d4 4 20 20
swap d6 4 20 20
add.w d6,d4 4 20 20
swap d4 4 20 20
move.l a3,d6 4 20 20
add.w d6,d5 4 20 20
swap d5 4 20 20
swap d6 4 20 20
add.w d6,d5 4 20 20
swap d5 4 20 20
movem.l d0-d5,(a6) 56 280 280
lea 24(a6),a6 8 40 40
subi.w #12,(a7) 16 80 80
bcc.s .loopA 10/8 48 48
move.w (a7)+,d7 8 8 8
.final:
addi.w #11,d7 8 8 8
bmi.s .done 10/8 10 8
.loopB:
move.w (a4),d0 8 8
move.w (a5),d6 8 8
move.w d6,(a4)+ 8 8
move.w d0,(a5)+ 8 8
add.w d6,d0 4 4
move.w d0,(a6)+ 8 8
dbra d7,.loopB 10/14 14
.done:
rts 16 16 16
---- ----
2454 2510