assembly optimization micro-optimization motorola 68000

68000 Assembly – Build a String from Characters not Present in Another & Return Its Length (stack-passed params)

Task

Develop a program for the Motorola 68000 that

Creates a string C containing every character from A that does not appear anywhere in B.

Computes the length of this new string C.

All logic must live in one or more sub-routines with the parameters passed via the stack.

I leaned heavily on the code from my previous question (where I concatenated / compared strings with lots of help from users :-) ) and collected all the optimization tips from the comments here. Below is the new routine that re-uses that calling convention and incorporates every suggestion.

          ORG   $8000

; ---------- data ----------
StringA   DC.B 'I say Hello',0
StringB   DC.B 'I say World',0
StringC   DS.B 256          ; caller must guarantee size
LenRes    DS.W 0            ; length of C (result)

; ---------- main ----------
START:
    PEA.L  StringC          ; &C
    PEA.L  StringB          ; &B
    PEA.L  StringA          ; &A
    BSR.S  FilterAndLength  ; build C, return len in D0
    MOVE.W D0,LenRes        ; store length
    ADDA.L #12,A7           ; pop params
    SIMHALT

; ---------- FilterAndLength(&A,&B,&C) ----------
; OUT: D0 = length of C
; MOD: D1-D2, A0-A3
FilterAndLength:
    MOVEA.L 4(A7),A0        ; A0 -> A
    MOVEA.L 8(A7),A1        ; A1 -> B (base)
    MOVEA.L 12(A7),A2       ; A2 -> C
    CLR.W   D0              ; len = 0

NextCharA:
    MOVE.B  (A0)+,D1        ; next char from A
    TST.B   D1
    BEQ     Finish          ; end of A
    MOVEA.L A1,A3           ; restart scan of B

ScanB:
    MOVE.B  (A3)+,D2
    TST.B   D2
    BEQ     NotFound        ; ran off end of B
    CMP.B   D2,D1
    BEQ     NextCharA       ; found in B, skip char
    BRA     ScanB

NotFound:                   ; char not in B → keep it
    MOVE.B  D1,(A2)+
    ADDQ.W  #1,D0
    BRA     NextCharA

Finish:
    CLR.B   (A2)+           ; zero-terminate C
    RTS

          END   START

Questions

Can this be shortened or further optimized?
Did I interpret the assignment correctly? (skip any character in A that appears at least once in B, build C, return its length)
Extras worth adding? e.g. overflow checks, faster word/long moves if buffers are aligned, or splitting the “char-in-string” test into its own subroutine?

Thanks for any feedback!

What I didn't enter (voluntarily) and explain why [hoping it's the right reason]

CMPM byte-parallel compare While CMPM.B (A0)+,(A3)+ can compare two memory streams with post-increment, here i need a nested scan of B for each A character, so a simple CMP.B inside a loop made more sense, right?
Word/long moves & compares If A, B and C were all word-aligned and the length were known, one could use MOVE.L/CMP.L to process 4 bytes at a time, then handle the final 0-3 bytes separately.
Bitmap lookup table (256-bit) For O(1) membership tests, i could build a 256-bit bitmap in data for B once, then filter A against that table instead of rescanning B each time. That's faster but uses extra RAM, right?
Separate sub-routine for “char in string” I might factor out the inner scan (ScanB) into its own CharInString(&ch,&B) function, still passing parameters on the stack. This improves modularity but adds call/return overhead.

If any of these extensions would be useful, let me know and I can integrate them!

Edit — @chtz’s latest suggestions

This is the version that integrates all the improvements listed in the answer @chtz wrote to me [I hope I understood correctly]:

          ORG    $8000

; ---------- data ----------
StringA   DC.B  'I say Hello',0
StringB   DC.B  'I say World',0
StringC   DS.B  256                 ; caller guarantees size
LenRes    DS.W  0                   ; |C|

; ---------- main ----------
START:
    PEA.L  StringC                  ; &C
    PEA.L  StringB                  ; &B
    PEA.L  StringA                  ; &A
    BSR.S  FilterLen_Bitmap         ; build C, length → D0
    MOVE.W D0,LenRes
    ADDA.L #12,A7
    SIMHALT

; ---------------------------------------------------------------
; FilterLen_Bitmap(&A,&B,&C)  –  D0 = length of C
; Stack layout after bitmap push:
;   +32 return  +36 &A  +40 &B  +44 &C
; ---------------------------------------------------------------
; Reg-use: A0/A1, D0-D2.  No other registers need saving.
; ---------------------------------------------------------------
FilterLen_Bitmap:

    ; ---- reserve & clear 32-byte bitmap (8×CLR.L -(A7)) ----
    MOVEQ   #7,D1
ClrBmp: CLR.L  -(A7)
        DBF    D1,ClrBmp           ; A7 → bitmap base

    ; ---- build bitmap from StringB ----
    CLR.W   D0                     ; upper bytes = 0
    MOVEA.L 40(A7),A0              ; A0 = &B
    MOVE.B  (A0)+,D0
    BEQ.S   BuildDone              ; empty B → skip
Bloop: MOVE.W  D0,D1
       LSR.W   #3,D1               ; byte offset 0-31
       BSET    D0,(A7,D1.W)        ; set bit
       MOVE.B  (A0)+,D0
       BNE.S   Bloop
BuildDone:

    ; ---- filter A into C ----
    MOVEA.L 36(A7),A0              ; A0 = &A
    MOVEA.L 44(A7),A1              ; A1 = &C
    CLR.W   D0                     ; length counter = 0
    CLR.W   D1
    MOVE.B  (A0)+,D1
    BEQ.S   Finish
Aloop: MOVE.W  D1,D2
       LSR.W   #3,D2               ; bitmap byte offset
       BTST    D1,(A7,D2.W)        ; char present in B?
       BNE.S   Skip
       MOVE.B  D1,(A1)+            ; keep char
       ADDQ.W  #1,D0
Skip:  MOVE.B  (A0)+,D1
       BNE.S   Aloop

Finish:
    CLR.B   (A1)                   ; null-terminate C
    ADDA.W  #32,A7                 ; pop bitmap
    RTS

          END   START

Is the code okay now or is there still something that can be improved or did I miss something? If there are any errors, I will try to fix them.

Solution

Further improvements relative to your edited question (Revision 2, everything untested):

For clearing 32 bytes on the stack, instead of:

    SUBA.L  #32,A7            ; reserve 32 bytes for bitmap
    LEA     (A7),A4           ; A4 = bitmap base

    ; ---- clear bitmap (8×CLR.L = 32 B) ----
    MOVEQ   #7,D3
ClrBmp: CLR.L  (A4)+
        DBF    D3,ClrBmp
    LEA     (A7),A4           ; restore A4 = base

Do it the other way around directly by decrementing the stackpointer:

        MOVEQ   #7,D3
ClrBmp: CLR.L  -(A7)
        DBF    D3,ClrBmp

That loop implementation needs 8 bytes, you could unroll it to 8 instructions taking 16 bytes:

    CLR.L  -(A7)
    CLR.L  -(A7)
    CLR.L  -(A7)
    CLR.L  -(A7)
    CLR.L  -(A7)
    CLR.L  -(A7)
    CLR.L  -(A7)
    CLR.L  -(A7)

EDIT: You can actually save some cycles using MOVEM.L for clearing the memory, e.g.

    MOVEQ   #0,D0
    MOVEQ   #0,D1     ; this also makes a later CLR.W D1 redundant
    MOVEA.L D0,A0     ; Clear any 4 registers you don't mind getting destroyed
    MOVEA.L D0,A1     ;
    MOVEM.L D0-D1/A0-A1,-(A7)  ; Clear 4*4=16 bytes of memory
    MOVEM.L D0-D1/A0-A1,-(A7)  ; Clear next 16 bytes of memory

This technique would also allow to set memory to 0xff, using MOVEQ #-1,D0, etc.

Since you don't change A7 afterwards, there is no need to copy it into A4

For the B-loop, instead of:

    ; ---- populate bitmap with bytes of B ----
    MOVEA.L 12+32(A7),A1      ; A1 = &B
Bloop: MOVE.B  (A1)+,D1
       BEQ     Bdone
       MOVE    D1,D2
       LSR.W   #3,D2          ; byte offset 0-31
       MOVE    D1,D3
       ANDI.B  #7,D3          ; bit index 0-7
       BTST    D3,(A4,D2.W)   ; already set? (optional)
       BSET    D3,(A4,D2.W)   ; set bit
       BRA     Bloop
Bdone:

There is one mistake, that you need to make sure that the upper byte of D2.W needs to be cleared before you copy D1.B into it (or it must be cleared in D1.W). Than, BTST is of course redundant here (you don't even use the result). Also the ANDI.B #7,D3 is not necessary, since BSET only considers the lowest 3 bits of D3 anyways. This also makes copying to D3 not needed.

Another small improvement, is to move the condition of the loop to the end, having just one unconditional jump at the beginning, instead of once per loop:

    CLR.W   D1                ; unless you have a register which was already cleared
    MOVEA.L 12+32(A7),A1      ; A1 = &B
    BRA.S   Bstart
Bloop: MOVE.W  D1,D2
       LSR.W   #3,D2          ; byte offset 0-31
       BSET    D1,(A7,D2.W)   ; set bit
Bstart: MOVE.B  (A1)+,D1
       BNE.S   Bloop

With one instruction (2 bytes) more, you could also copy the end of the loop to the beginning, saving the unconditional jump as well. Also using D0/D1 instead of D1/D2 can save an instruction later:

    CLR.W   D0                ; unless you have a register which was already cleared
    MOVEA.L 12+32(A7),A1      ; A1 = &B
    MOVE.B  (A1)+,D0
    BEQ.S   Bend              ;;;; Actually this would be a special case, 
                              ;;;; if B is empty you could just copy A
Bloop: MOVE.W  D0,D1
       LSR.W   #3,D1          ; byte offset 0-31
       BSET    D0,(A7,D1.W)   ; set bit
       MOVE.B  (A1)+,D0
       BNE.S   Bloop
Bend:  ; Note that D0.W and the upper byte of D1.W are 0 now.

Similar optimizations can be done for the A-loop:

    ; ---- filter A against bitmap, build C ----
    MOVEA.L 16+32(A7),A0      ; A0 = &A
    MOVEA.L 20+32(A7),A2      ; A2 = &C   <--- Here you could use A1 instead
    ;; CLR.W   D0             ; length = 0. Already the case after previous loop
    MOVE.B  (A0)+,D1          ; upper byte of D1.W was already cleared
    BEQ.S   Finish
Aloop:
    MOVE.W  D1,D2
    LSR.W   #3,D2             ; byte offset
    BTST    D1,(A7,D2.W)
    BNE.S   NextA             ; present in B → skip
    MOVE.B  D1,(A2)+          ; keep char
    ADDQ.W  #1,D0
NextA:
    MOVE.B  (A0)+,D1
    BNE.S   Aloop

Finish:
    CLR.B   (A2)              ; null-terminate C
    ADDA.W  #32,A7            ; drop bitmap 
         ;; ADDA.W extends the argument to 32bit before adding, alternative:
         ;  LEA   32(A7),A7
    MOVEM.L (A7)+,D2-D3/A2-A4 ; restore regs
    RTS

With my modifications you don't need to store/restore the same registers (if only 1 register needs saving, this also does not require a MOVEM.L.

    MOVE.W  D3,-(A7)   ;; If only D3.W needs (re)storing
; ... rest of code
    MOVE.W  (A7)+,D3   ;; If only D3.W needs (re)storing
    RTS

68000 Assembly – Build a String from Characters *not* Present in Another & Return Its Length (stack-passed params)

Task

Questions

What I didn't enter (voluntarily) and explain why [hoping it's the right reason]

Edit — @chtz’s latest suggestions

68000 Assembly – Build a String from Characters not Present in Another & Return Its Length (stack-passed params)