[SOLVED] ARM ITCM interface and FLash access

So let's just try it. NUCLEO-F767ZI

Cortex-M7s in general:

Prefetch Unit
The Prefetch Unit (PFU) provides:
1.2.3
• 64-bit instruction fetch bandwidth.
• 4x64-bit pre-fetch queue to decouple instruction pre-fetch from DPU pipeline operation.
• A Branch Target Address Cache (BTAC) for the single-cycle turn-around of branch predictor state and target address.
• A static branch predictor when no BTAC is specified.
• Forwarding of flags for early resolution of direct branches in the decoder and first execution stages of the processor pipeline.

For this test the branch prediction gets in the way so turn that off:

Set ACTLR to 00003000 (hex, most numbers here are hex)

Don't see how to disable the PFU wouldn't expect to have control like that anyway.

So we expect the prefetch to read 64 bits at a time, 4 instructions on an aligned boundary.

From ST

The DBANK bit is set indicating a single bank

Instruction prefetch

In case of single bank mode (nDBANK option bit is set) 256 bits representing 8 instructions of 32 bits to 16 instructions of 16 bits according to the program launched. So, in the case of sequential code, at least 8 CPU cycles are needed to execute the previous instruction line read.

So ST is going to turn that into a 256 bit or 16 instructions

Using the systick timer. Am running at 16Mhz so flash is at zero wait states.

08000140 <inner>:
 8000140:   46c0        nop         ; (mov r8, r8)
 8000142:   46c0        nop         ; (mov r8, r8)
 8000144:   46c0        nop         ; (mov r8, r8)
 8000146:   46c0        nop         ; (mov r8, r8)
 8000148:   46c0        nop         ; (mov r8, r8)
 800014a:   46c0        nop         ; (mov r8, r8)
 800014c:   3901        subs    r1, #1
 800014e:   d1f7        bne.n   8000140 <inner>

00120002

So 12 clocks per loop. Two prefetches from ARM, the first one becomes a single ST fetch. Should be zero wait state. Note the address this is AXIM

If I reduce the number of nops it stays at 0x1200xx until here:

08000140 <inner>:
 8000140:   46c0        nop         ; (mov r8, r8)
 8000142:   46c0        nop         ; (mov r8, r8)
 8000144:   3901        subs    r1, #1
 8000146:   d1fb        bne.n   8000140 <inner>

00060003

One arm fetch instead of two. Time cut in half, so the prefetch is dominating our performance.

08000140 <inner>:
 8000140:   46c0        nop         ; (mov r8, r8)
 8000142:   46c0        nop         ; (mov r8, r8)
 8000144:   46c0        nop         ; (mov r8, r8)
 8000146:   46c0        nop         ; (mov r8, r8)
 8000148:   3901        subs    r1, #1
 800014a:   d1f9        bne.n   8000140 <inner>

000 (zero wait states)

00120002

001 (1 wait state)

00140002

002 (2 wait states)

00160002

202 (2 wait states enable ART)

0015FFF3

Why would that affect AXIM?

so each wait state adds 2 clocks per loop, there are two fetches per loop so perhaps each fetch causes st to do one of its 256 bit fetches, that seems broken though.

Switch to ITCM

00200140 <inner>:
  200140:   46c0        nop         ; (mov r8, r8)
  200142:   46c0        nop         ; (mov r8, r8)
  200144:   46c0        nop         ; (mov r8, r8)
  200146:   46c0        nop         ; (mov r8, r8)
  200148:   3901        subs    r1, #1
  20014a:   d1f9        bne.n   200140 <inner>

000

00070004

001

00080003

002

00090003

202

00070004

ram

00070003

So ITCM alone, zero wait state, ART off is 7 clocks per loop for a 6 instruction loop with a branch. seems reasonable. For this tiny test turning on ART with 2 wait states puts us back at 7 per loop.

Note that from ram this code runs at 7 per loop as well. Let's try another couple

I didn't look for other branch predictors other than the BTAC

First thing to note you don't want to ever run an MCU faster than you have to, burns power, many you need to add flash wait states, many the CPU and peripherals have different max clock speeds so there is a boundary where it becomes non-linear (takes X clock cycles at a slow clock rate, peripheral clock = CPU clock, there is a place where N times faster is NX clocks to do something, but one or more boundaries where it takes more than NX to do something when the CPU clock is N times faster). This particular part has this non-linear issue. If you are using libraries from ST to set the clock then you are possibly getting worst case flash wait states, where if you set it up and read the documentation you might be able to shave one or two/few.

The Cortex-M7 has optional L1 caches, didn't mess with it this time around but ST had this ART thing before these came out and I believe they defeat/disable the i cache at least, would it make it better or worse to have both? If it has it then that would make the first past slow then the remaining possibly faster even in AXIM space. You are welcome to try it. Seem to remember they did something tricky with a strap on the processor core, it wasn't easy to see how it was defeated, and that may not be this chip/core but was definitely ST. The M4 doesn't have a cache so it would have to be an M7 that I messed with (this one in particular).

So the short answer is the performance isn't that horrible if you leave off the ART and/or run out of AXIM. ST has implemented the flash such that the ITCM interface is faster than AXIM. We can see the effects of the ARMs fetch itself if you enable branch prediction you can see that as well if you turn it on.

It shouldn't be difficult to create a benchmark that defeats these features, just like you can make one that makes the L1 caches (or any other cache) hurt performance. The ART thing like any other cache makes performance less predictable and as you change your code, add a line, remove a line the performance can jump anywhere from no change to a lot as a result.

Depending on the processor and fetch sizes and alignments your code performance can vary by adding or removing code above the performance-sensitive part(s) of the project, but that depends on some factors that we rarely have visibility into.

Hard to tell looks like they are claiming that ART reduces power. I would expect it to increase power having those srams on/clocked. Don't see an obvious how much you save if you turn off the flash and run from ram. The M7 parts are not really meant to be the low powered parts like some STM32L parts where you can get to ones/tens of microamps (micro not milli, been there done that).

The small number of clocks 0x70004 instead of 0x70000 have to do with some of the fetching overhead be it ARM or ST or a combination of the two. To see memory/flash performance you need to disable as much of the features like branch prediction, caches that you can disable, etc. Otherwise, it's hard to measure performance and then make assumptions about what the flash/memory/bus is doing. I suspect there are still things I didn't turn off to make a clean measurement, and/or can't turn off. And simple nop loops (tried other non-nop instructions, didn't change it) won't tell you everything. Using the docs as a guide you can try to cache-thrash the ART or other and see what kind of hits that take.

For performance-critical code you can run from RAM and avoid all of these issues, I didn't search for it but assume that these parts SRAM can run as fast as the CPU. The answer isn't jumping out at me, you can figure it out.

Note my test actually looks like

    ldr r2,[r0]
inner:
    nop
    nop
    nop
    nop
    sub r1,#1
    bne inner
    ldr r3,[r0]
    sub r0,r2,r3
    bx lr

where sampling of systick is just in front of and back of. Before the branch. To measure ART you would want to sample the time before the branch for a memory range that has not been read it is not magically possible to read that faster the first read into the cache should be slower. If I move the time sampling further away I can see it go from 0x7000A to 0x70027 for 0 to 15 wait states with ART on. That is a noticeable performance hit for branches into code that has not been run/cached yet. Knowing the size of the art fetches, should be easy to make a test that hops a lot and the ART feature starts to not matter.

Short answer, the ITCM is a different bus interface on the ARM core, ST has implemented their design such that there is a performance gain. So even without ART enabled using ITCM is faster than AXIM (likely an ARM bus thing not ST flash thing). If you are running fast enough clock rates to have to add wait states to the flash then ART can mostly erase those.