assembly optimization micro-optimization motorola 68000

68000 Assembly – Reverse Array A into B via Stack Parameters

Task

Develop a Motorola 68000 program that

Reverses the order of elements in an input array A

Copies the result into a second array B

All logic must live in one subroutine (or more) with parameters passed via the stack.

I’ve applied everything I learned from my previous works (parameter passing, register usage, minimal stack frames, loop optimisations, etc.). Here’s my current solution:

          ORG    $8000

; ---------- data ----------
VectorA   DC.B  1,2,3,4,5,6    ; example input
LenA      EQU   6             ; number of elements
VectorB   DS.B  LenA          ; destination buffer

; ---------- entry point ----------
START:
    PEA.L  VectorB            ; push &B  (long)
    MOVE.L #LenA,-(A7)        ; push length (long)
    PEA.L  VectorA            ; push &A  (long)
    BSR.S  ReverseCopy        ; reverse into B
    ADDA.L #12,A7             ; clean up 3 longs
    SIMHALT                  ; halt simulator

; --------------------------------------------------------
; ReverseCopy(&A, length, &B)
;  
; IN (stack): 4(A7)=&A, 8(A7)=length, 12(A7)=&B
; OUT: none  (but B filled)
; MOD: A0, A1, D0, D1
; --------------------------------------------------------
ReverseCopy:
    MOVEA.L  4(A7),A0        ; A0 = address of A
    MOVE.L   8(A7),D0        ; D0 = length (32-bit)
    MOVEA.L 12(A7),A1        ; A1 = address of B

    BEQ.S   .Done            ; if length=0, nothing to do

    ADD.L   D0,A0            ; point A0 just past the last element
    MOVE.W  D0,D1            ; D1 = length as 16-bit (for DBF)
    SUBQ.W  #1,D1            ; DBF will run exactly length times

.Loop:
    MOVE.B  -(A0),(A1)+      ; copy last A into next B
    DBF     D1,.Loop         ; decrement D1, loop if >=0

.Done:
    RTS                      ; return to START

          END   START

Edit — full dispatching solution with all variants

I’ve added a dispatch routine that lets you choose between three implementations:

32-bit single-byte loop (no truncation)
68010 MOVE+DBF single-byte loop (tight loop buffer)
68000 word-swap loop (2 bytes at a time via ROR.W #8)

; --------------------------------------------------------
; Entrypoint for dispatch (invoked by BSR.S ReverseAllVariants)
; Uncomment the variant you want to use:
ReverseAllVariants:
    ; JMP ReverseSingleByte
    ; JMP ReverseDBF
    JMP ReverseSwap

; -- Variant 1: 32-bit single-byte loop --------------------
ReverseSingleByte:
    MOVE.L  8(A7),D0            ; D0 = length
    BEQ.S   .Done1              ; skip if zero
    MOVEA.L 4(A7),A0            ; A0 = &A
    MOVEA.L 12(A7),A1           ; A1 = &B
    ADD.L   D0,A0               ; A0 → end of A
1Loop:
    MOVE.B  -(A0),(A1)+         ; copy last byte
    SUBQ.L  #1,D0               ; decrement counter
    BNE.S   1Loop               ; until D0 == 0
.Done1:
    RTS

; -- Variant 2: 68010 MOVE+DBF single-byte loop ----------
ReverseDBF:
    MOVE.L  8(A7),D0            ; D0 = length
    BEQ.S   .Done2              ; skip if zero
    MOVEA.L 4(A7),A0            ; A0 = &A
    MOVEA.L 12(A7),A1           ; A1 = &B
    ADD.L   D0,A0               ; A0 → end of A
    MOVE.W  D0,D0               ; D0.W = length
    SUBQ.W  #1,D0               ; prepare for DBF
DBFLoop:
    MOVE.B  -(A0),(A1)+         ; copy last byte
    DBF     D0,DBFLoop          ; until D0.W == -1
.Done2:
    RTS

; -- Variant 3: 68000 word-swap loop ----------------------
ReverseSwap:
    MOVE.L   8(A7),D0           ; D0 = length
    BEQ.S    .Done3             ; skip if zero
    MOVE.L   D0,D2              ; D2 = length copy
    ANDI.L   #1,D2              ; D2 = odd remainder
    MOVE.L   D0,D1
    LSR.L    #1,D1              ; D1 = word_count = len/2
    MOVEA.L  4(A7),A0           ; A0 = &A
    MOVEA.L 12(A7),A1           ; A1 = &B
    TST.L    D2                 ; if odd, handle trailing byte
    BEQ.S    .WordLoopStart
    MOVE.B   -(A0),(A1)+
.WordLoopStart:
    MOVE.W   -(A0),D0           ; load two bytes
    ROR.W    #8,D0              ; swap bytes
    MOVE.W   D0,(A1)+           ; store swapped word
    DBF      D1,.WordLoopStart  ; repeat
.Done3:
    RTS

Let me know if anything still needs tweaking :-).

Solution

MOVEA.L  4(A7),A0        ; A0 = address of A
MOVE.L   8(A7),D0        ; D0 = length (32-bit)
MOVEA.L 12(A7),A1        ; A1 = address of B
BEQ.S   .Done            ; if length=0, nothing to do

Technically this is correct because MOVEA does not change any flags, but it will make more sense if you keep the branch instruction close to the relevant load instruction:

MOVE.L  8(A7),D0         ; D0 = length (32-bit)
BEQ.S   .Done            ; if length=0, nothing to do
MOVEA.L 4(A7),A0         ; A0 = address of A
MOVEA.L 12(A7),A1        ; A1 = address of B

ADD.L   D0,A0            ; point A0 just past the last element

The ADD instruction should not have allowed adding to an address register. Use ADDA instead. Typo maybe?

MOVE.W  D0,D1            ; D1 = length as 16-bit (for DBF)
SUBQ.W  #1,D1            ; DBF will run exactly length times

There is no reason to first copy D0 into D1. The DBF D0,... instruction will use the low word of D0 and don't care about what's in the high word of D0. Anyway, if there would be high contents in D0 then the whole task would have failed, wouldn't it?

Reverses the order of elements in an input array A

Copies the result into a second array B

It is possible to interpret this task description as requiring you to reverse the A array in situ, and only then make a copy of the result.
Next is a version of the program that does exactly that:

          ORG    $8000

; ---------- data ----------
VectorA   DC.B  1,2,3,4,5,6   ; example input
LenA      EQU   6             ; number of elements
VectorB   DS.B  LenA          ; destination buffer

; ---------- entry point ----------
START:
    PEA.L  VectorB            ; push &B (long)
    MOVE.L #LenA,-(A7)        ; push length (long)
    PEA.L  VectorA            ; push &A (long)
    BSR.S  ReverseAndCopy     ; reverse A and copy to B
    ADDA.L #12,A7             ; clean up 3 longs
    SIMHALT                   ; halt simulator

; --------------------------------------------------------
; ReverseAndCopy(&A, length, &B)
;  
; IN (stack): 4(A7)=&A, 8(A7)=length, 12(A7)=&B
; OUT: A reversed, B copy of A
; MOD: A0, A1, D0, D1
; --------------------------------------------------------
ReverseAndCopy:
    MOVE.L  8(A7),D0         ; D0 = length (32-bit)
    BEQ.S   .Done            ; if length=0, nothing to do
    MOVEA.L 4(A7),A0         ; A0 = address of first element A
    LEA.L   (A0,D0.L),A1     ; A1 = address after last element A
    LSR.L   #1,D0            ; D0 = number of pairs of elements
    BEQ.S   .Copy            ; length is 1

; 1. Reverses the order of elements in input array A
.Loop1:
    MOVE.B  -(A1),D1
    MOVE.B  (A0),(A1)
    MOVE.B  D1,(A0)+
    SUBQ.L  #1,D0
    BNE.S   .Loop1

; 2. Copies the result into the array B
.Copy:
    MOVEA.L 4(A7),A0         ; A0 = address of A
    MOVEA.L 12(A7),A1        ; A1 = address of B
    MOVE.L  8(A7),D0         ; D0 = length (32-bit)

.Loop2:
    MOVE.B  (A0)+,(A1)+
    SUBQ.L  #1,D0
    BNE.S   .Loop2

.Done:
    RTS

          END   START

Tip 1. If the length is known to fit in 16 bits, then it would be better to replace SUBQ BNE by DBRA. Remember to decrement the counter beforehand because DBRA runs until -1.
Tip 2. If the addresses of the strings are known to be word-aligned, then it would be better to copy by word, or even by longword.

Concerning the latest addition of variants

The Variant 2: "MOVE+DBF single-byte loop" is not specifically 68010. Works fine on 68000. Note that there still is that silly MOVE.W D0,D0 present.
The Variant 3: "word-swap loop" will have trouble with an odd length because words in memory must reside on an even address. The only possible good outcome (for an odd length) is if array A has an even address and array B has an odd address. And I find that unlikely in practice... Also the loop will always be running one time too many! See Tip 1 above. Additionally, in case of length=1, the loop shouldn't even run at all!

If you want to move to better performing variants, then imposing some sensible restrictions is needed. e.g. both arrays begin at an even address and the array length is even (at least for byte-sized elements).