assemblyoptimizationmicro-optimizationmotorola68000

68000 Assembly – one-pass swap-and-sum of two word vectors (can it be done better?)


Task

I’m practising fixed-size loops on a Motorola 68000. Given two 10-word signed arrays A and B, I need to

  1. swap each pair (A[i] ↔ B[i]) in place, and
  2. build a third array C where C[i] = A[i] + B[i] (after the swap the sum is the same).

All parameters must be passed on the stack.


Code that works in Easy68k

          ORG    $8000
VectorA   DC.W  2,2,1,0,4,4,2,5,-1,2
VectorB   DC.W  2,3,1,2,-7,4,-3,-2,0,2
C         DS.W  10
LEN       EQU   10            ; array size

START:
    MOVE.W  #LEN,-(A7)        ; push n (word)
    PEA.L   C                 ; push &C
    PEA.L   VectorB           ; push &B
    PEA.L   VectorA           ; push &A
    BSR.S   SwapAndSumN
    ADDA.L  #14,A7            ; 3 longs (12) + 1 word (2) = 14
    SIMHALT

; SwapAndSumN(&A,&B,&C,n)  — one pass, branch-free counters
; 4(A7)=&A  8(A7)=&B  12(A7)=&C  16(A7)=n (word)
SwapAndSumN:
    MOVE.W 16(A7),D7          ; D7 = n
    SUBQ.W #1,D7              ; DBRA wants n–1
    MOVEA.L 4(A7),A0          ; A0 = A
    MOVEA.L 8(A7),A1          ; A1 = B
    MOVEA.L 12(A7),A2         ; A2 = C
Loop:
    MOVE.W  (A0),D0           ; D0 = A[i]
    MOVE.W  (A1),D1           ; D1 = B[i]
    MOVE.W  D1,(A0)+          ; B → A
    MOVE.W  D0,(A1)+          ; old A → B
    ADD.W   D1,D0             ; sum
    MOVE.W  D0,(A2)+          ; C[i]
    DBRA    D7,Loop
    RTS
          END  START

Question (single, specific)

Swapping two words needs a temporary register, so I already read both elements once. Is the two-loads / two-stores sequence above the best the 68000 can do, or is there a classic trick (e.g. EXG, MOVE.L on aligned data, MOVEM, etc.) that would shave a cycle or a byte off this loop while still keeping everything word-aligned and using only the registers I have?


Edit

          ORG    $8000
; =============================================================
; 68000  —  Pair-wise swap A[i]↔B[i]  **and** store C[i] = A[i]+B[i]
;           using Peter Cordes’ swap/move.l trick:
;             add.w / swap / swap / add.w / swap / move.l
;           ▸ one MOVE.L store per pair
;           ▸ no extra MOVE.W, same cycle budget as 4× MOVE.W
;           Arrays word-aligned, LEN even.
; =============================================================

VectorA   DC.W  2,2,1,0,4,4,2,5,-1,2
VectorB   DC.W  2,3,1,2,-7,4,-3,-2,0,2
C         DS.W  10
LEN       EQU    10

; ---------- main ----------
START:
    MOVE.W  #LEN,-(A7)                    ; n  (word)
    PEA.L   C                             ; &C
    PEA.L   VectorB                       ; &B
    PEA.L   VectorA                       ; &A
    BSR.S   SwapSum_LongPairs             ; do the work
    ADDA.L  #14,A7                        ; pop 3 longs + 1 word
    SIMHALT

; ------------------------------------------------------------
; SwapSum_LongPairs(&A,&B,&C,n)
;  4(A7)=&A  8(A7)=&B  12(A7)=&C  16(A7)=n (word, even)
;  Registers: A0,A1,A2  /  D0,D1,D7
;  Inner sequence:  LdA LdB  | add.w | swap | swap | add.w | swap | move.l
; ------------------------------------------------------------
SwapSum_LongPairs:
    MOVE.W   16(A7),D7          ; D7 = n
    LSR.W    #1,D7              ; pair-count
    SUBQ.W   #1,D7              ; DBRA wants count-1
    MOVEA.L   4(A7),A0          ; A
    MOVEA.L   8(A7),A1          ; B
    MOVEA.L  12(A7),A2          ; C

PairLp:
    MOVE.L   (A0),D0            ; D0 = A_hi|A_lo
    MOVE.L   (A1),D1            ; D1 = B_hi|B_lo
    MOVE.L   D1,(A0)+           ; B → A
    MOVE.L   D0,(A1)+           ; A → B

    ADD.W    D1,D0              ; low-word sum  → D0.low
    SWAP     D0                 ; D0: sum_lo | A_hi
    SWAP     D1                 ; D1: B_lo   | B_hi
    ADD.W    D1,D0              ; low-word: A_hi + B_hi → sum_hi
    SWAP     D0                 ; D0: sum_hi | sum_lo  (big-endian order)

    MOVE.L   D0,(A2)+           ; write both words of C pair
    DBRA     D7,PairLp
    RTS
          END  START

Thus the inner loop is now load-swap-add-swap-add-swap-store, no wasted moves, one DBRA branch, and minimal memory traffic.


Solution

  • Processing more elements per main iteration is the way to go

    A lot of good advice in the comments! I especially found enticing the suggestion to use MOVEM as a way to unroll this loop.
    So, in order to find out just by how much each of the proposed solutions differ, I wrote the five following code snippets that all can accept any unsigned word-sized value for the number of array elements, nothing excluded. There should never be any problem with alignment because, as these array elements are words, they should already reside at even addresses. (.l loads still only need 2-byte alignment; 68000 only has a 16-bit external data bus.)

    Since it is not terribly interesting to want to optimize the code for a very short array, I have set the number of array elements at 60 so that the core loop at least can run several times. And 60 is also the smallest multiple of those elements-per-iteration values.

    The numbers in the n= columns are total cycle counts for calling the function with that many elements.

    bytes elem/iter n=60 % n=61 %
    38 1 3332 100.00 3386 100.00
    +SWAP 62 2 2810 84.23 2852 84.23
    +MOVEM 84 5 2476 74.31 2532 74.78
    +MOVEM 90 6 2416 72.51 2472 73.01
    +SWAP+MOVEM 152 12 2454 73.65 2510 74.13

    It does not pay to combine MOVEM with SWAP because an address register is not 'swappable'. Also, the codesize seems to get a bit out of hand.


    One element per iteration:

                                          n=60    n=61
    SwapAndSum:                           ====    ====
      movea.l 4(a7),a4            16        16      16
      movea.l 8(a7),a5            16        16      16
      movea.l 12(a7),a6           16        16      16
      move.w  16(a7),d7           12        12      12
    
      subq.w  #1,d7                4         4       4
      bcs.s   .done               10/8       8       8
    .loop:
      move.w  (a4),d0              8       480     488
      move.w  (a5),d1              8       480     488
      move.w  d1,(a4)+             8       480     488
      move.w  d0,(a5)+             8       480     488
      add.w   d1,d0                4       240     244
      move.w  d0,(a6)+             8       480     488
      dbra    d7,.loop            10/14    604     614
    
    .done:
      rts                         16        16      16
                                          ----    ----
                                          3332    3386
    

    Two elements per iteration:

                                          n=60    n=61
    SwapAndSum:                           ====    ====
      movea.l 4(a7),a4            16        16      16
      movea.l 8(a7),a5            16        16      16
      movea.l 12(a7),a6           16        16      16
      move.w  16(a7),d7           12        12      12
    
      lsr.w   #1,d7                8         8       8
      bcc.s   .neven              10/8      10       8
    .nodd:
      move.w  (a4),d0              8                 8
      move.w  (a5),d1              8                 8
      move.w  d1,(a4)+             8                 8
      move.w  d0,(a5)+             8                 8
      add.w   d1,d0                4                 4
      move.w  d0,(a6)+             8                 8
    
    .neven:
      subq.w  #1,d7                4         4       4
      bcs.s   .done               10/8       8       8
    .loop:
      move.l  (a4),d0             12       360     360
      move.l  (a5),d1             12       360     360
      move.l  d1,(a4)+            12       360     360
      move.l  d0,(a5)+            12       360     360
      add.w   d1,d0                4       120     120
      swap    d0                   4       120     120
      swap    d1                   4       120     120
      add.w   d1,d0                4       120     120
      swap    d0                   4       120     120
      move.l  d0,(a6)+            12       360     360
      dbra    d7,.loop            10/14    304     304
    
    .done:
      rts                         16        16      16
                                          ----    ----
                                          2810    2852
    

    Five elements per iteration:

    It is important to choose the registers carefully; adding A0 to D2 is cheaper than adding D2 to A0 (same for A1 and A2).

                                          n=60    n=61
    SwapAndSum:                           ====    ====
      movea.l 4(a7),a4            16        16      16
      movea.l 8(a7),a5            16        16      16
      movea.l 12(a7),a6           16        16      16
      move.w  16(a7),d7           12        12      12
    
      subq.w  #5,d7                4         4       4
      bcs.s   .final              10/8       8       8
    
    .loopA:
      movem.w (a4)+,d0-d4         32       384     384
      movem.w (a5)+,d5-d6/a0-a2   32       384     384
      movem.w d5-d6/a0-a2,-10(a4) 32       384     384
      movem.w d0-d4,-10(a5)       32       384     384
      add.w   d5,d0                4        48      48
      add.w   d6,d1                4        48      48
      add.w   a0,d2                4        48      48
      add.w   a1,d3                4        48      48
      add.w   a2,d4                4        48      48
      movem.w d0-d4,(a6)          28       336     336
      lea     10(a6),a6            8        96      96
      subq.w  #5,d7                4        48      48
      bcc.s   .loopA              10/8     118     118
    
    .final:
      addq.w  #4,d7                4         4       4
      bmi.s   .done               10/8      10       8
    .loopB:
      move.w  (a4),d0              8                 8
      move.w  (a5),d6              8                 8
      move.w  d6,(a4)+             8                 8
      move.w  d0,(a5)+             8                 8
      add.w   d6,d0                4                 4
      move.w  d0,(a6)+             8                 8
      dbra    d7,.loopB           10/14             14
    .done:
      rts                         16        16      16
                                          ----    ----
                                          2476    2532
    

    Six elements per iteration:

    It is important to choose the registers carefully; adding A0 to D2 is cheaper than adding D2 to A0 (same for A1, A2, and A3).
    Because all 16 registers are taken, I use a memory-based counter. It is however better to not decrement the count at 16(a7), but instead push it anew so that using the cheaper (a7) addressing is possible.

                                          n=60    n=61
    SwapAndSum:                           ====    ====
      movea.l 4(a7),a4            16        16      16
      movea.l 8(a7),a5            16        16      16
      movea.l 12(a7),a6           16        16      16
      move.w  16(a7),d7           12        12      12
    
      subq.w  #6,d7                4         4       4
      bcs.s   .final              10/8       8       8
    
      move.w  d7,-(a7)             8         8       8
    .loopA:
      movem.w (a4)+,d0-d5         36       360     360
      movem.w (a5)+,d6-d7/a0-a3   36       360     360
      movem.w d6-d7/a0-a3,-12(a4) 36       360     360
      movem.w d0-d5,-12(a5)       36       360     360
      add.w   d6,d0                4        40      40
      add.w   d7,d1                4        40      40
      add.w   a0,d2                4        40      40
      add.w   a1,d3                4        40      40
      add.w   a2,d4                4        40      40
      add.w   a3,d5                4        40      40
      movem.w d0-d5,(a6)          32       320     320
      lea     12(a6),a6            8        80      80
      subq.w  #6,(a7)             12       120     120
      bcc.s   .loopA              10/8      98      98
      move.w  (a7)+,d7             8         8       8
    
    .final:
      addq.w  #5,d7                4         4       4
      bmi.s   .done               10/8      10       8
    .loopB:
      move.w  (a4),d0              8                 8
      move.w  (a5),d6              8                 8
      move.w  d6,(a4)+             8                 8
      move.w  d0,(a5)+             8                 8
      add.w   d6,d0                4                 4
      move.w  d0,(a6)+             8                 8
      dbra    d7,.loopB           10/14             14
    .done:
      rts                         16        16      16
                                          ----    ----
                                          2416    2472
    

    Twelve elements per iteration:

    Because all 16 registers are taken, I use a memory-based counter. It is however better to not decrement the count at 16(a7), but instead push it anew so that using the cheaper (a7) addressing is possible.
    Because none of A0, A1, A2, or A3 is 'swappable', I first copy them to the by then available D6 register.

                                          n=60    n=61
    SwapAndSum:                           ====    ====
      movea.l 4(a7),a4            16        16      16
      movea.l 8(a7),a5            16        16      16
      movea.l 12(a7),a6           16        16      16
      move.w  16(a7),d7           12        12      12
    
      subi.w  #12,d7               8         8       8
      bcs.s   .final              10/8       8       8
    
      move.w  d7,-(a7)             8         8       8
    .loopA:
      movem.l (a4)+,d0-d5         60       300     300
      movem.l (a5)+,d6-d7/a0-a3   60       300     300
      movem.l d6-d7/a0-a3,-24(a4) 60       300     300
      movem.l d0-d5,-24(a5)       60       300     300
    
      add.w   d6,d0                4        20      20
      swap    d0                   4        20      20
      swap    d6                   4        20      20
      add.w   d6,d0                4        20      20
      swap    d0                   4        20      20
    
      add.w   d7,d1                4        20      20
      swap    d1                   4        20      20
      swap    d7                   4        20      20
      add.w   d7,d1                4        20      20
      swap    d1                   4        20      20
    
      move.l  a0,d6                4        20      20
      add.w   d6,d2                4        20      20
      swap    d2                   4        20      20
      swap    d6                   4        20      20
      add.w   d6,d2                4        20      20
      swap    d2                   4        20      20
    
      move.l  a1,d6                4        20      20
      add.w   d6,d3                4        20      20
      swap    d3                   4        20      20
      swap    d6                   4        20      20
      add.w   d6,d3                4        20      20
      swap    d3                   4        20      20
    
      move.l  a2,d6                4        20      20
      add.w   d6,d4                4        20      20
      swap    d4                   4        20      20
      swap    d6                   4        20      20
      add.w   d6,d4                4        20      20
      swap    d4                   4        20      20
    
      move.l  a3,d6                4        20      20
      add.w   d6,d5                4        20      20
      swap    d5                   4        20      20
      swap    d6                   4        20      20
      add.w   d6,d5                4        20      20
      swap    d5                   4        20      20
    
      movem.l d0-d5,(a6)          56       280     280
      lea     24(a6),a6            8        40      40
      subi.w  #12,(a7)            16        80      80
      bcc.s   .loopA              10/8      48      48
      move.w  (a7)+,d7             8         8       8
    
    .final:
      addi.w  #11,d7               8         8       8
      bmi.s   .done               10/8      10       8
    .loopB:
      move.w  (a4),d0              8                 8
      move.w  (a5),d6              8                 8
      move.w  d6,(a4)+             8                 8
      move.w  d0,(a5)+             8                 8
      add.w   d6,d0                4                 4
      move.w  d0,(a6)+             8                 8
      dbra    d7,.loopB           10/14             14
    .done:
      rts                         16        16      16
                                          ----    ----
                                          2454    2510