armcortex-mmicro-optimizationexecution-timethumb

How do I reduce execution time and number of cycles for a factorial loop? And/or code-size?


Basically I'm having a hard time getting the execution time any lower than it is, as well as reducing the amount of clock cycles and memory size. Does anyone have any idea on how I can do this? The code works fine I just want to change it a bit.

Wrote a working code, but don't want to mess up the code, but also don't know what changes to make.

; Calculation of a factorial value using a simple loop

; set up the exception addresses
THUMB
AREA RESET, CODE, READONLY
EXPORT  __Vectors
EXPORT Reset_Handler
__Vectors 
DCD 0x00180000     ; top of the stack 
DCD Reset_Handler  ; reset vector - where the program starts

AREA 2a_Code, CODE, READONLY
Reset_Handler
ENTRY
start   
MOV r1,#0    ; count the number of multiplications performed 
MOV r2,#3    ; the final value in the factorial calculation
MOV r3,#1    ; the factorial result will be stored here

; loop r2 times forming the product  
fact
ADD r1,r1,#1  ; find the next multiplicand
MUL r3,r1,r3  ; form the next product - note that MUL r3,r3,r1 gives unpredictable output
CMP r1,r2     ; check if the final value has been reached
BMI fact      ; continue if all products have not been formed

exit    ; stay in an endless loop 
B exit
END

The current results are: Memory Size: 0x00000024 Clock Cycles: 22 Total Execution Time:1.1 Micro seconds

We are working with the Cortex M3

I just need any of these to be reduced, the changes to the code can be minor as long as it produces different results.


Solution

  • Often code-size and performance are a tradeoff. Unrolling a loop often helps performance (for large inputs at least), but requires extra logic outside the loop to handle the cleanup and so on.


    Most of this answer was assuming a higher-performance CPU like Cortex-A9 or Cortex-A53 where software pipelining to create instruction-level parallelism would be helpful. Cortex M3 is scalar and has a single-cycle multiply instruction, making it much simpler to optimize for.

    (The original question didn't specify a core, and I was expecting that even low-end CPUs would have multi-cycle mul latency. I only found Cortex-M3 numbers after writing it.)

    Your code will probably bottleneck on the latency of integer multiply. Unlike add, where the result will be ready the next cycle, mul is complex and takes multiple cycles to produce a result.

    (Except on some very slowly-clocked chips, like apparently Cortex-M3 has a 1-cycle mul instruction. But Cortex-M0/M0+/M23 are available with a choice of 1 cycle or 32 cycle performance for that instruction! Slow iterative = smaller silicon.)


    The multiply execution unit itself is often pipelined so multiple independent multiplies can be in flight at once, but your factorial loop needs each multiply result as an input to the next iteration. (Only for higher-performance cores, not Cortex-M series. The 32-cycle multiply on slow cortex-M chips is iterative and presumably not pipelined, so another multiply couldn't start while it's running, and there'd be no benefit to exposing any instruction-level parallelism beyond reducing loop overhead.)

    Notice that multiplication is associative: 1 * 2 * 3 = 3 * 2 * 1, so we can count down from n, as @ensc's answer points out. Or (1*2) * (3*4) = 1*2*3*4.

    We could instead do 1 * 2 * ... * (n/2) in parallel with n/2+1 * n/2+2 * n/2+3 * ... * n, interleaving work on those two dependency chains. Or we could interleave 1 * 3 * 5 * ... * n with 2 * 4 * 6 * ... n-1, in a loop that did n -= 2 and calculates n+1 from that. (Then at the end, you multiply those 2 products).

    This is obviously going to require more code-size, but could help performance a lot.


    Of course, a lookup table is another workaround. If you only care about inputs that don't overflow a 32-bit result, that's a pretty small table. But that has a significant size cost.


    Even on an in-order CPU (where instruction execution has to start in program order), long-running instructions like cache-miss loads, or multiplies, may be allowed to complete out of order, so e.g. some add instructions could run after starting a mul but before the mul result was written back. Or even starting another independent mul instruction in the shadow of an earlier mul's latency.

    I googled some ARM performance numbers to maybe get a feel for what's typical.

    For example, Cortex-A9 is an older fairly common high-end ARMv7 CPU that is superscalar (multiple instructions per cycle) with out-of-order execution.

    mul "takes" 2 cycles, and has 4 cycle result latency. They don't explain what they mean by the non-latency cost. Perhaps that's the reciprocal throughput of the execution unit, like how often you can start a new independent operation. It's an out-of-order CPU so it doesn't make sense for it to stall other instructions for 2 cycles. In the NEON SIMD instruction section, they explain what looks like the same "cycles" number:

    This is the number of issue cycles the particular instruction consumes, and is the absolute minimum number of cycles per instruction if no operand interlocks are present.

    (operand interlocks = waiting for an input operand to be ready, if an earlier instruction hasn't produced a result yet).

    (Cortex-A9 does support packed integer multiplication, so for large factorials you could look at doing 4 multiplies in parallel starting one vector per 4 cycles, using vmul.32 q1, q1, q2. Or 2 per 2 cycles with 64-bit d registers, but then you'd need more vadd instructions and unlike multiply, vadd.32 is just as fast with 128-bit q regs as with 64-bit vectors. So SIMD can give you twice the multiply throughput of scalar on Cortex-A9, if you use enough registers to hide the large latency. But SIMD would probably only be useful with n so large that n! overflows a 32-bit integer, so you get a result modulo 2^32.)


    Lower latency ARM multiply instructions:

    mul is a 32x32 => 32-bit multiply. On Cortex-A9, it has 2c throughput and 4c latency.

    (muls is a 16-bit instruction in thumb mode, and should be preferred unless you need to not clobber the flags. mul in Thumb mode is only available in ARMv6T2 and later.)

    smulbb is a 16x16 => 32-bit signed multiply that only reads the low half of its inputs, but has 1c throughput and 3c latency on A9. (BB = bottom, bottom. The other combinations are also available, along with multiply-accumulate and various funky things.)

    There is not 2-byte Thumb version of smulxy, so this is worse for code-size than muls.

    Unfortunately smulxy isn't available in an unsigned version, so that limits the range of inputs we can use it with to positive int16_t, not uint16_t.

    But if we only care about the case where the final 32-bit result doesn't overflow, we can arrange our order of operations so the last multiply has 2 inputs of similar magnitude (both large-ish 16-bit numbers). i.e. as close to sqrt(n!) as possible. So e.g. the product of odds and evens would be reasonable, but (n-1)! * n would be the worst case because that would require (n-1)! to fit in 16 bits. Actually the worst case would be counting down from n so the last one is a multiply by 3 then 2. We could special case the multiply by 2 to a left shift...


    Putting these pieces together, notice that multiplying by 1 is a no-op (except with smulbb where it truncates the input to 16 bit). So we can unroll in a way that stops after a multiply by 1 or 2 depending on the input being odd or even.

    So instead of knowing which is odd and which is even, we just have lo (starting with n-1) and hi (starting with n).

    ;; UNTESTED, but it does assemble with the GNU assembler, after sed -i 's/;/@/' arm-fact.S
    ;; and replacing THUMB with
    ; .thumb
    ; .syntax unified
    THUMB
    
    ;; Input: n in r0.   (n is signed positive, otherwise we return n.)
    ;; Output: n! in r0.
    ;; clobbers: r1, r2, r3
    ;; pre-conditions: n! < 2^31.  Or maybe slightly lower.
    fact:
        subs   r3, r0, #3   ; r3 = lo = n-3  (first multiplier for loprod)
        bls   .Ltiny_input
        subs   r2, r0, #2   ; r2 = hi = n-2  (first multiplier for hiprod)
        subs   r1, r0, #1   ; r1 = loprod = n-1
                            ; r0 = hiprod = n
    
    .Lloop:                 ; do {
        smulbb  r0,r0, r2      ; hiprod *= hi
        subs    r2, #2         ; hi -= 2 for next iter
        smulbb  r1,r1, r3
        subs    r3, #2         ; lo -= 2 for next iter
        bgt     .Lloop       ; while((lo-=2) > 0);  signed condition
        ; r3 = 0 or -1, r2 = 1 or 0.  The last multiplies were:
        ;       hiprod *= 2 and loprod *= 1  for even n
        ;   or  hiprod *= 3 and loprod *= 2  for odd n
    
        ; muls  r0, r1
        smulbb  r0,r0, r1      ; return  hiprod *= loprod
    
        bx lr    ; or inline this
    
    .Ltiny_input:   ; alternate return path for tiny inputs
        ; r0 = n.   flags still set from  n - 3
        IT eq                  ; GAS insists on explicit IT for thumb mode
        moveq   r0, #6         ; 3! = 6, else n! = n for smaller n=1 or 2.
                               ; 0! = 1 case is not handled, nor are negative inputs
        bx lr
    

    (.L in a label name makes it a local label that doesn't show up in the object file, at least in GAS syntax. Maybe not in ARMASM, if you're using that assembler.)

    ARM assembly lets you leave out the destination when it's the same as the first source, for some instructions like subs but not smulbb. You could write it out like subs r2, r2, #2 every time if you want.

    You might use muls r0, r1 for the final product, because the final hiprod is a bit higher than loprod. The product might not overflow even if hiprod > max int16_t. That would save 2 bytes of code-size, too, but add 1 cycle of latency on Cortex-A9. (BTW, ARMv6 fixed the "unpredictable result" with mul d,d, src weirdness, and your code used 32-bit Thumb2 instructions, thus it only works on ARMv6T2 and above anyway.)


    With 2 accumulators for the products, this can possibly run at 2 multiplies per 3 cycles on Cortex-A9, depending greatly on the CPU micro-architecture and whether its front-end can keep up. On an in-order ARM, I'd be worried about it being able to start other instructions before a multiply finished.

    It might be better to spend 2 extra bytes on sub instead of subs so we can compute the flags a couple instructions ahead of the branch, maybe reducing branch mispredict penalty and avoiding stalls on in-order CPUs. smulbb doesn't touch flags, so we can do loprod first and have the hi stuff not touch flags.

    .loop:                  ; do {
        smulbb  r1, r3       ; loprod *= lo
        subs    r3, #2       ; lo -= 2 for next iter, and set flags
        smulbb  r0, r2       ; hiprod *= hi
        sub     r2, #2       ; hi -= 2 for next iter (no flags)
        bgt     .loop       ; while((lo-=2) >= 0);
    

    Note that we're modifying r3 and r2 right after smulbb reads them, avoiding creating a stall for the data dependency on in-order chips.


    You're using Thumb mode and optimizing for code-size, so it's important to know which forms of which instructions can use a 2-byte / 16-bit encoding and which are only available as 32-bit Thumb2 encodings.

    subs Rd, Rn, #imm can be encoded as a 16-bit Thumb instruction for imm=0..7 (3-bit immediate). Or with the same register as src and destination, for imm=0..255. So my copy-and-sub instructions are compact.

    Non-flag-setting sub can't be a 16-bit instruction except inside a IT block, or with SP as the operand.

    Predicated instructions in Thumb mode, like moveq r0, #6, require the assembler to use an IT instruction to introduce predication for the next up-to-4 instructions. In ARM mode, the top 4 bits of every instruction signals predication. (If you don't use a suffix, the assembler encodes it as ALways, i.e. not predicated.)

    We could handle the n==0 case with another 4 or 6 bytes, with cmp r0,#0 / moveq r0, #1. Maybe getting it down to 4 bytes if we put the tst / mov inside the same IT block. IT doesn't snapshot the actual flag condition, it snapshots which predicate, so flag-setting instructions inside an IT block can have an effect on later instructions in the same block. (I think this is right, but I'm not 100% sure).

    tiny_input:    ; r0 = n,  flags set according to n-3
        ITET EQ
        moveq  r0, #6
        cmpne  r0, #0
        moveq  r0, #1
    

    Or there's 16-bit cbnz to conditionally jump over a mov r0, #1. But the branch target must be from 4 to 130 bytes after the cbnz, so we can't jump over just a single 16-bit instruction, apparently!


    Code-size for my version:

    $ arm-none-eabi-gcc -g -c -mcpu=cortex-a9 arm-fact.S
    $ arm-none-eabi-objdump -drwC arm-fact.o 
    
    arm-fact.o:     file format elf32-littlearm
    
    
    Disassembly of section .text:
    
    00000000 <fact>:
       0:   1ec3            subs    r3, r0, #3
       2:   d90b            bls.n   1c <.tiny_input>
       4:   1e82            subs    r2, r0, #2
       6:   1e41            subs    r1, r0, #1
    
    00000008 <.loop>:
       8:   fb10 f002       smulbb  r0, r0, r2
       c:   3a02            subs    r2, #2
       e:   fb11 f103       smulbb  r1, r1, r3
      12:   3b02            subs    r3, #2
      14:   dcf8            bgt.n   8 <.loop>
      16:   fb10 f001       smulbb  r0, r0, r1
      1a:   4770            bx      lr
    
    0000001c <.tiny_input>:
      1c:   bf08            it      eq
      1e:   2006            moveq   r0, #6
      20:   4770            bx      lr
    

    So it's 0x22 bytes for this function. (Or 0x26 if we want to handle 0! = 1.)

    It's larger than your version (your byte count includes some constants in memory, and the mov instructions to produce input), but in theory maybe better than twice as fast for large input, on CPUs with pipelined multipliers). And maybe much faster for inputs from 1 to 3, where it just branches once and produces the result.


    You probably don't have anything like a Cortex-A9, because your 1.1 microseconds = 22 clock cycles means a 20MHz clock speed, while Cortex-A9 was available in 0.8 to 2GHz.

    So maybe you have a much simpler in-order core like Cortex M3? M3 does support the mul instruction, and Thumb2 mode. And wikipedia says its multiply is 1 cycle! So that's weird, I'm surprised it has that efficient a multiplier. Or just that it clocks so slowly that there's time for a lot of gate delays in 1 stage, and it's only a 3-stage pipeline.


    Cortex-M3 version:

    subs and muls are single-cycle on Cortex-M3. I haven't found perf numbers on branches, but they're common so I'm assuming it's probably 1 cycle and doesn't cause a big fetch bubble (if correctly predicted...). The Cortex-M3 HTML manual has a section on Branch target forwarding which appears to be about reducing the fetch bubble.

    Its instruction timing table shows b<cond> costs 1 cycle for not-taken, or 2 cycles for taken. (1 for the branch, 1 for the pipeline reload after an immediate displacement.). So taken branches are slow compared to sub/mul and unrolling would be valuable, so my code above should still work well. (But multiple product accumulators are not necessary, so it can be simplified).

    Optimizing for code-size:

    ;; UNTESTED
    THUMB
    
    ;; Input: n in r0.   (n is signed positive, otherwise we return n.)
    ;; Output: n! in r0.
    ;; clobbers: r1
    fact:
        subs   r1, r0, #1     ; i = n-1
        bls   .Ltiny_input    ; jump if n<=1
    
    .Lloop:                 ; do {
        muls    r0, r1         ; prod *= i
        subs    r1, #1         ; --i
        bgt     .Lloop      ; while(--i > 0);  signed condition
        ; r1 = 0, r0 = n! 
        ; last multiply was a redundant prod *= 1 but avoiding that would take a cmp
    .Ltiny_input:   ; alternate return path for tiny inputs
        ; 0! = 1 case is not handled, nor are negative inputs
    
    
        bx lr    ; or inline this
    

    I think that's the smallest we can manage. The loop has 3 instructions, and probably costs 4 cycles per iteration (1 + 1 + 2, the taken branch costing 2 cycles).

    00000000 <fact>:
       0:   1e41            subs    r1, r0, #1
       2:   d902            bls.n   a <fact+0xa>
       4:   4348            muls    r0, r1
       6:   3901            subs    r1, #1
       8:   dcfc            bgt.n   4 <fact+0x4>
       a:   4770            bx      lr           # don't count this if inlining
    

    So this is 0xa = 10 bytes, not counting the bx lr return instruction.

    We could handle the 0! = 1 case with an IT block after the first subs, before the branch, so we can still jump to right after the loop (instead of to a separate block like my Cortex-A9 version). You could use this trick for it, too, though.

        subs   r1, r0, #1     ; i = n-1
        it lt
        movlt  r0, #1         ; n = 1 for  n<1
        bls   .Ltiny_input    ; return n if n was <=1
    

    If we needed more range for the branch, we could use itt ls / movls r0, #1, so the branch was inside the IT block (where branch instructions can use an encoding that spends more bits on displacement and none on the predicate). But it's a short range in this case, so I chose to leave r0 unmodified in the r0 == 1 case. I don't know if there are any CPUs where it's more efficient or lower latency for a predicated instruction to be a NOP instead of running, but there might be.


    Without unrolling, putting a cmp in the loop to avoid the last *=1 iteration would cost us an extra cycle per iteration (4 cycles instead of 3), so only pay for itself with n=2 or maybe n=3.

    Unrolling could help speed significantly for larger inputs, going from 1 mul per 3 cycles to asymptotically approaching 1 mul per 2 cycles (sub + mul + amortized loop overhead). I can't see any way to avoid an instruction like sub or mov to generate a separate input for each mul, except by hard-coding special case sequences for each n (like *2 * 4 = *8 = left shift by 3) when you could instead just hard-code the answer.