Infinite Loop Issue in Scalar Product Calculation of Two Word Arrays in Easy68k

I’m working on an assembly program using Easy68k and programming in M68k assembly. I’m trying to calculate the scalar product (dot product) of two arrays of 16-bit words (A and B). However, I’m encountering an issue where my LOOP becomes infinite, and I can’t figure out why.

Here’s the code:

           ORG $8000

START:
      clr.l d0
      lea.l A,a0
      lea.l B,a1 
      moveq #len,d1   

LOOP:
      move.w (a0)+,d5  
      move.w (a1)+,d6  
      mulu.w d5,d6  
      add.l d6,d0  
      subq.l #1,d1    
      bne.s LOOP 

      move.l d0,scalar_prod   
      SIMHALT

    ; DATES
A DC.W 10,20,30,40 
B DC.W 50,60,70,80  
len EQU 4
scalar_prod DC.L 0  
    
    END START

The Issue

I thought the loop should stop when both d1 and d2 are decremented to zero, but it doesn’t terminate as expected. Can anyone help me understand what might be going wrong here? Fixed by using move.b #len,d1

Optimization Suggestions:

I’d also appreciate any suggestions on how to optimize this code. Thanks! :-)

Solution

Fixed by using move.b #len,d1

That's not entirely correct. Because you use the counter per:

subq.l #1,d1    
bne.s LOOP

the zero flag will pertain to the whole of D1 of which bits 8 through 31 could contain garbage (move.b only loads the lowest byte). The moveq #len,d1 from your final edit, does always load the full longword and does so even without mentioning the size specifier.

I’d also appreciate any suggestions on how to optimize this code. Thanks! :-)

Combine operations for faster execution and often clobbering one register less.

move.w (a0)+,d5     8 clocks
move.w (a1)+,d6     8 clocks
mulu.w d5,d6       70 clocks (max)

There's no real need to first load D5 and then multiply to D6. You can write it as:

move.w (a1)+,d6         8 clocks  
mulu.w (a0)+,d6        74 clocks (max)

Use the smallest size that gets the job done.

subq.l #1,d1        8 clocks

Both byte and word use only half these clocks, so better write:

subq.b #1,d1            4 clocks

START:
      lea.l   A, a0
      lea.l   B, a1 
      clr.l   d0
      moveq.l #len, d1   
LOOP:
      move.w  (a1)+, d2
      mulu.w  (a0)+, d2
      add.l   d2, d0
      subq.b  #1, d1    
      bne.s   LOOP
      move.l  d0, scalar_prod   
      SIMHALT