Develop a program for the Motorola 68000 that
- Creates a string C containing every character from A that does not appear anywhere in B.
- Computes the length of this new string C.
All logic must live in one or more sub-routines with the parameters passed via the stack.
I leaned heavily on the code from my previous question (where I concatenated / compared strings with lots of help from users :-) ) and collected all the optimization tips from the comments here. Below is the new routine that re-uses that calling convention and incorporates every suggestion.
ORG $8000
; ---------- data ----------
StringA DC.B 'I say Hello',0
StringB DC.B 'I say World',0
StringC DS.B 256 ; caller must guarantee size
LenRes DS.W 0 ; length of C (result)
; ---------- main ----------
START:
PEA.L StringC ; &C
PEA.L StringB ; &B
PEA.L StringA ; &A
BSR.S FilterAndLength ; build C, return len in D0
MOVE.W D0,LenRes ; store length
ADDA.L #12,A7 ; pop params
SIMHALT
; ---------- FilterAndLength(&A,&B,&C) ----------
; OUT: D0 = length of C
; MOD: D1-D2, A0-A3
FilterAndLength:
MOVEA.L 4(A7),A0 ; A0 -> A
MOVEA.L 8(A7),A1 ; A1 -> B (base)
MOVEA.L 12(A7),A2 ; A2 -> C
CLR.W D0 ; len = 0
NextCharA:
MOVE.B (A0)+,D1 ; next char from A
TST.B D1
BEQ Finish ; end of A
MOVEA.L A1,A3 ; restart scan of B
ScanB:
MOVE.B (A3)+,D2
TST.B D2
BEQ NotFound ; ran off end of B
CMP.B D2,D1
BEQ NextCharA ; found in B, skip char
BRA ScanB
NotFound: ; char not in B → keep it
MOVE.B D1,(A2)+
ADDQ.W #1,D0
BRA NextCharA
Finish:
CLR.B (A2)+ ; zero-terminate C
RTS
END START
Thanks for any feedback!
CMPM
byte-parallel compare
While CMPM.B (A0)+,(A3)+
can compare two memory streams with post-increment, here i need a nested scan of B for each A character, so a simple CMP.B
inside a loop made more sense, right?
Word/long moves & compares
If A, B and C were all word-aligned and the length were known, one could use MOVE.L
/CMP.L
to process 4 bytes at a time, then handle the final 0-3 bytes separately.
Bitmap lookup table (256-bit) For O(1) membership tests, i could build a 256-bit bitmap in data for B once, then filter A against that table instead of rescanning B each time. That's faster but uses extra RAM, right?
Separate sub-routine for “char in string”
I might factor out the inner scan (ScanB
) into its own CharInString(&ch,&B)
function, still passing parameters on the stack. This improves modularity but adds call/return overhead.
If any of these extensions would be useful, let me know and I can integrate them!
This is the version that integrates all the improvements listed in the answer @chtz wrote to me [I hope I understood correctly]:
ORG $8000
; ---------- data ----------
StringA DC.B 'I say Hello',0
StringB DC.B 'I say World',0
StringC DS.B 256 ; caller guarantees size
LenRes DS.W 0 ; |C|
; ---------- main ----------
START:
PEA.L StringC ; &C
PEA.L StringB ; &B
PEA.L StringA ; &A
BSR.S FilterLen_Bitmap ; build C, length → D0
MOVE.W D0,LenRes
ADDA.L #12,A7
SIMHALT
; ---------------------------------------------------------------
; FilterLen_Bitmap(&A,&B,&C) – D0 = length of C
; Stack layout after bitmap push:
; +32 return +36 &A +40 &B +44 &C
; ---------------------------------------------------------------
; Reg-use: A0/A1, D0-D2. No other registers need saving.
; ---------------------------------------------------------------
FilterLen_Bitmap:
; ---- reserve & clear 32-byte bitmap (8×CLR.L -(A7)) ----
MOVEQ #7,D1
ClrBmp: CLR.L -(A7)
DBF D1,ClrBmp ; A7 → bitmap base
; ---- build bitmap from StringB ----
CLR.W D0 ; upper bytes = 0
MOVEA.L 40(A7),A0 ; A0 = &B
MOVE.B (A0)+,D0
BEQ.S BuildDone ; empty B → skip
Bloop: MOVE.W D0,D1
LSR.W #3,D1 ; byte offset 0-31
BSET D0,(A7,D1.W) ; set bit
MOVE.B (A0)+,D0
BNE.S Bloop
BuildDone:
; ---- filter A into C ----
MOVEA.L 36(A7),A0 ; A0 = &A
MOVEA.L 44(A7),A1 ; A1 = &C
CLR.W D0 ; length counter = 0
CLR.W D1
MOVE.B (A0)+,D1
BEQ.S Finish
Aloop: MOVE.W D1,D2
LSR.W #3,D2 ; bitmap byte offset
BTST D1,(A7,D2.W) ; char present in B?
BNE.S Skip
MOVE.B D1,(A1)+ ; keep char
ADDQ.W #1,D0
Skip: MOVE.B (A0)+,D1
BNE.S Aloop
Finish:
CLR.B (A1) ; null-terminate C
ADDA.W #32,A7 ; pop bitmap
RTS
END START
Is the code okay now or is there still something that can be improved or did I miss something? If there are any errors, I will try to fix them.
Further improvements relative to your edited question (Revision 2, everything untested):
For clearing 32 bytes on the stack, instead of:
SUBA.L #32,A7 ; reserve 32 bytes for bitmap
LEA (A7),A4 ; A4 = bitmap base
; ---- clear bitmap (8×CLR.L = 32 B) ----
MOVEQ #7,D3
ClrBmp: CLR.L (A4)+
DBF D3,ClrBmp
LEA (A7),A4 ; restore A4 = base
Do it the other way around directly by decrementing the stackpointer:
MOVEQ #7,D3
ClrBmp: CLR.L -(A7)
DBF D3,ClrBmp
That loop implementation needs 8 bytes, you could unroll it to 8 instructions taking 16 bytes:
CLR.L -(A7)
CLR.L -(A7)
CLR.L -(A7)
CLR.L -(A7)
CLR.L -(A7)
CLR.L -(A7)
CLR.L -(A7)
CLR.L -(A7)
EDIT:
You can actually save some cycles using MOVEM.L
for clearing the memory, e.g.
MOVEQ #0,D0
MOVEQ #0,D1 ; this also makes a later CLR.W D1 redundant
MOVEA.L D0,A0 ; Clear any 4 registers you don't mind getting destroyed
MOVEA.L D0,A1 ;
MOVEM.L D0-D1/A0-A1,-(A7) ; Clear 4*4=16 bytes of memory
MOVEM.L D0-D1/A0-A1,-(A7) ; Clear next 16 bytes of memory
This technique would also allow to set memory to 0xff
, using MOVEQ #-1,D0
, etc.
Since you don't change A7
afterwards, there is no need to copy it into A4
For the B-loop, instead of:
; ---- populate bitmap with bytes of B ----
MOVEA.L 12+32(A7),A1 ; A1 = &B
Bloop: MOVE.B (A1)+,D1
BEQ Bdone
MOVE D1,D2
LSR.W #3,D2 ; byte offset 0-31
MOVE D1,D3
ANDI.B #7,D3 ; bit index 0-7
BTST D3,(A4,D2.W) ; already set? (optional)
BSET D3,(A4,D2.W) ; set bit
BRA Bloop
Bdone:
There is one mistake, that you need to make sure that the upper byte of D2.W
needs to be cleared before you copy D1.B
into it (or it must be cleared in D1.W
).
Than, BTST
is of course redundant here (you don't even use the result). Also the ANDI.B #7,D3
is not necessary, since BSET
only considers the lowest 3 bits of D3
anyways. This also makes copying to D3
not needed.
Another small improvement, is to move the condition of the loop to the end, having just one unconditional jump at the beginning, instead of once per loop:
CLR.W D1 ; unless you have a register which was already cleared
MOVEA.L 12+32(A7),A1 ; A1 = &B
BRA.S Bstart
Bloop: MOVE.W D1,D2
LSR.W #3,D2 ; byte offset 0-31
BSET D1,(A7,D2.W) ; set bit
Bstart: MOVE.B (A1)+,D1
BNE.S Bloop
With one instruction (2 bytes) more, you could also copy the end of the loop to the beginning, saving the unconditional jump as well. Also using D0/D1 instead of D1/D2 can save an instruction later:
CLR.W D0 ; unless you have a register which was already cleared
MOVEA.L 12+32(A7),A1 ; A1 = &B
MOVE.B (A1)+,D0
BEQ.S Bend ;;;; Actually this would be a special case,
;;;; if B is empty you could just copy A
Bloop: MOVE.W D0,D1
LSR.W #3,D1 ; byte offset 0-31
BSET D0,(A7,D1.W) ; set bit
MOVE.B (A1)+,D0
BNE.S Bloop
Bend: ; Note that D0.W and the upper byte of D1.W are 0 now.
Similar optimizations can be done for the A-loop:
; ---- filter A against bitmap, build C ----
MOVEA.L 16+32(A7),A0 ; A0 = &A
MOVEA.L 20+32(A7),A2 ; A2 = &C <--- Here you could use A1 instead
;; CLR.W D0 ; length = 0. Already the case after previous loop
MOVE.B (A0)+,D1 ; upper byte of D1.W was already cleared
BEQ.S Finish
Aloop:
MOVE.W D1,D2
LSR.W #3,D2 ; byte offset
BTST D1,(A7,D2.W)
BNE.S NextA ; present in B → skip
MOVE.B D1,(A2)+ ; keep char
ADDQ.W #1,D0
NextA:
MOVE.B (A0)+,D1
BNE.S Aloop
Finish:
CLR.B (A2) ; null-terminate C
ADDA.W #32,A7 ; drop bitmap
;; ADDA.W extends the argument to 32bit before adding, alternative:
; LEA 32(A7),A7
MOVEM.L (A7)+,D2-D3/A2-A4 ; restore regs
RTS
With my modifications you don't need to store/restore the same registers (if only 1 register needs saving, this also does not require a MOVEM.L
.
MOVE.W D3,-(A7) ;; If only D3.W needs (re)storing
; ... rest of code
MOVE.W (A7)+,D3 ;; If only D3.W needs (re)storing
RTS