Why is dbra so fast for a very large loop count in Motorola 68k?

I'm learning Motorola 68k assembly, and I wrote the following time wasting loop:

    move.l #0x0fffffff,%d0
    bsr timewaster
    rts

timewaster:
    dbra %d0,timewaster
    rts

This time wasting loop finishes almost immediately. I stepped through the code in a debugger to make sure that it actually subtracts d0 down to 0 (which it does). However, this other time wasting loop takes forever to finish:

    move.l #0x0fffffff,%d0
    bsr timewaster
    rts

timewaster:
    sub.l #1,%d0
    bne timewaster
    rts

So why is the code using dbra so much faster?

I ran these in a TI-89 simulator.

Solution

While there would be some improvement due to less fetches on a real processor, the reason that there is such a big difference in timing is the the two methods use different sizes.

From the Programmer's Reference Manual, on the page for DBcc:

If the termination condition is not true, the low-order 16 bits of the counter data register decrement by one. If the result is -1, execution continues with the next instruction. If the result is not equal to -1,execution continues at the location indicated by the current value of the program counter plus the sign-extended 16-bit displacement.

So, the DBcc instruction only manipulates and checks the lower word of the loop count register. The SUB and Bcc version will therefore take ~4000 times longer than the DBcc one. If you use SUB.W instead of SUB.L I'd expect that you get more similar run times.

The DBcc instruction will execute 0x10000 times while the BNE instruction will execute 0xFFFFFFF times.

Note that the higher-order word of the loop counter if not affected by DBcc, so your loop should exit with 0x0FFFFFFF in D0. The SUB.L/BNE version should exit with 0 in D0.

This isn't particularly related to the question, but reading through the manuals, there seems to be a slight disagreement in some places on the exact behaviour of the DBcc instruction. Specifically, the behaviour when the loop counter is 0 when the condition is true. Both result in the branch not being taken, but they disagree on the final result in the loop count register.

The Programmer's Reference Manual, Revision 1 (M68000PM/AD, REV. 1) indicates that the condition being true takes precedence, and the decrement value of the loop counter is not stored back, leaving 0 in the register. The following is from the manual:

If Condition False
    Then (Dn - 1 -> Dn; If Dn != -1 Then PC + d_n -> PC)

The M68000 Microprocessors User’s Manual, Ninth Edition (MC68000UM), Appendix A (MC68010 Loop Mode Operation), says that the subtraction-by-one result takes precedence, and the result being -1 causes the result to be stored back, leaving -1 in the register. The following is constructed from description in the manual:

If Dn - 1 == -1
    Then Dn - 1 -> Dn
Else
    If Condition False
        Then (Dn - 1 -> Dn; PC + d_n -> PC)

Normally, an exit due to the count would leave -1, while a condition exit would leave a different value (assuming that the counter didn't start at 0xFFFF). The two sources disagree on the value in the register when both are true.

I'd assume that the PRM is correct, being the authoritative source for the behaviour, and since it matches the description earlier in the UM, but the UM might be hinting at how the instruction is implemented, at least on the MC68010.