execution-timestm32cortex-mmachine-instructionno-op

Processor Instruction Cycle Execution Time



Solution

  • ALL instructions require more than one clock cycle to execute. Fetch, decode, execute. If you are running on an stm32 you are likely taking several clocks per fetch just due to the slowness of the prom, if running from ram who knows if it is 168Mhz or slower. the arm busses generally take a number of clock cycles to do anything.

    Nobody talks about instruction cycles anymore because they are not deterministic. The answer is always "it depends".

    It may take X hours to build a single car, but if you start building a car then 30 seconds later start building another and every 30 seconds start another then after X hours you will have a new car every 30 seconds. Does that mean it takes 30 seconds to make a car? Of course not. But it does mean that once up and running you can average a new car every 30 seconds on that production line.

    That is exactly how processors work, it takes a number of clocks per instruction to run, but you pipeline them so that many are in the pipe at once so that the average is such that the core, if fed the right instructions one per clock, can complete those instructions one per clock. With branching, and slow memory/rom, you can't even expect to get that.

    If you want to do an experiment on your processor, then make a loop with a few hundred nops

    beg = read time
    load r0 = 100000
    top:
      nop
     nop
    nop
    nop
    nop
    nop
    ...
    nop
    nop
    nop
    r0 = r0 - 1
    bne top
    end = read timer
    

    If it takes fractions of a second to complete that loop then either make the number of nops larger or have it run an order of magnitude more loops. Actually you want to hit a significant number of timer ticks, not necessarily seconds or minutes on a wall clock but something in terms of a good sized number of timer ticks.

    Then do the math and compute the average.

    Repeat the experiment with the program sitting in ram instead of rom

    Slow the processor clock down to whatever the fastest time is that does not require a flash divisor, repeat running from flash.

    Being a cortex-m4 turn the I cache on, repeat using flash, repeat using ram (At 168Mhz).

    If you didn't get a range of different results from all of these experiments using the same test loop, you are probably doing something wrong.