STM32: Code execution seems to depend on its location in flash memory

I'm noticing a behavior that I cannot explain: the execution time of a function seems to depend on its location in the flash ROM. I am using a STM32F746NGH microcontroller (ARM-cortex M7 based) with STM32CubeIDE (GCC compiler for ARM).

Here are my tests:

I initialized the SysTick counter to trigger an interrupt with a fixed period T = 1ms. In the interrupt handler, I am switching (like a RTOS) between two threads: let's name them Thread1 and Thread2.

Each Thread is simply incrementing a variable.

Here is the code of the two threads:


uint32_t ctr1, ctr2;

void thread1(void)
{
    while(1)
    {
        ctr1++;
    }
}


void thread2(void)
{
    while(1)
    {
        ctr2++;
    }
}

When monitoring these variables, I noticed that ctr2 is incremented a lot faster than ctr1.

With this code: thread1's address is 0x08000418 and thread2's address is 0x0800042C.

Then, I tried to put another function in memory before thread1: let's name it thread0.

So my new code is:


uint32_t ctr0, ctr1, ctr2;


void thread0(void)
{
    while(1)
    {
        ctr0++;
    }
}

void thread1(void)
    {
    while(1)
    {
        ctr1++;
    }
}


void thread2(void)
{
    while(1)
    {
        ctr2++;
    }
}

With this new code: thread0 's address is 0x08000418 (thread1's location with the previous code) thread1's address is 0x0800042C (thread2's location with the previous code) and thread2's address is 0x08000440.

I can see that ctr1 and ctr2 are incremented with the same rate, and ctr0 is incremented a lot slower than these twos.

Finally, I've tried with 20 different threads. Each thread is incrementing a single variable (similar to the code shared above). I observe that the variables are incremented at two different rates: speed1 and speed2; speed1 being lower than speed2.

Thread	Address	Speed
Thread0	0x08000418	speed1
Thread1	0x0800042C	speed2
Thread2	0x08000440	speed2
Thread3	0x08000454	speed1
Thread4	0x08000468	speed2
Thread5	0x0800047C	speed2
Thread6	0x08000490	speed2
Thread7	0x080004A4	speed2
Thread8	0x080004B8	speed1
Thread9	0x080004CC	speed2
Thread10	0x080004E0	speed2
Thread11	0x080004F4	speed1
Thread12	0x08000508	speed2
Thread13	0x0800051C	speed2
Thread14	0x08000530	speed2
Thread15	0x08000544	speed2
Thread16	0x08000558	speed1
Thread17	0x0800056C	speed2
Thread18	0x08000580	speed2
Thread19	0x08000594	speed1

I've also checked in the assembly that all threads have a similar code (same code size, same instructions and same number of instructions); so it is not related to the code itself. Each thread has 10 instructions, so code size is 20 bytes (each instruction is 2 bytes wide). It corresponds to the increment (20 = 0x14) between each thread's memory address.

Here is the code of a thread (as said, other threads have a similar code):

task0:
08000418:   push    {r7}
0800041a:   add     r7, sp, #0
 21             task0_ctr += 1;
0800041c:   ldr     r3, [pc, #8]    ; (0x8000428 <task0+16>)
0800041e:   ldr     r3, [r3, #0]
08000420:   adds    r3, #1
08000422:   ldr     r2, [pc, #4]    ; (0x8000428 <task0+16>)
08000424:   str     r3, [r2, #0]
08000426:   b.n     0x800041c <task0+4>
08000428:   movs    r4, r3
0800042a:   movs    r0, #0

As you can see in the table, it seems that there is a pattern: One thread with speed1, two threads with speed2, one thread with speed1, 4 threads with speed2, and then restart the pattern.

I don't know if it is relevant/related, but in the Cortex M7 reference manual, I've found this section about the flash memory:

Instruction prefetch Each flash read operation provides 256 bits representing 8 instructions of 32 bits to 16 instructions of 16 bits according to the program launched. So, in case of sequential code, at least 8 CPU cycles are needed to execute the previous instruction line read. The prefetch on ITCM bus allows to read the sequential next line of instructions in the flash while the current instruction line is requested by the CPU. The prefetch can be enabled by setting the PRFTEN bit of the FLASH_ACR register. This feature is useful if at least one Wait State is needed to access the flash. When the code is not sequential (branch), the instruction may not be present neither in the current instruction line used nor in the prefetched instruction line. In this case (miss), the penalty in term of number of cycles is at least equal to the number of Wait States. Adaptive real-time memory

But I've checked in the table: functions fully contained in a 256-bits block can have either speed1 or speed2 and same for functions shared between two 256-bits block.

I don't understand what may be the cause of this behavior.

EDIT 1: as requested, here is the thread scheduler code:

__attribute__((naked)) void SysTick_Handler(void)
{
    __asm("CPSID I");           // disable global interrupts, equivalent to __disable_irq();


    /* save current thread's context: save R4, R5, ..., R11 (xPSR, PC, LR, R12, R3, R2, R1, R0 are automatically pushed on the stack by the processor). */
    __asm("PUSH {R4-R11}");


    /* OS_Tick += 1 */
    __asm("LDR R0, =OS_Tick");      // R0 = &OS_Tick
    __asm("LDR R1, [R0]");          // R1 = OS_Tick
    __asm("ADD R1, #1");            // R1 += 1
    __asm("STR R1, [R0]");          // OS_Tick = 1;

    /* Systick_Tick += 1 */
    __asm("LDR R0, =Systick_Tick");     // R0 = &Systick_Tick
    __asm("LDR R1, [R0]");              // R1 = Systick_Tick
    __asm("ADD R1, #1");                // R1 += 1
    __asm("STR R1, [R0]");              // Systick_Tick = 1;



    /* Scheduler: switch thread */
    __asm("LDR R0, =os_kernel_threads_list");       // R0 = &os_kernel_threads_list
    __asm("LDR R1, [R0]");                          // R1 = current_thread
    __asm("STR SP, [R1,#4]");                       // stack_ptr = SP
    __asm("LDR R2, [R1]");                          // R2 = next_tcb
    __asm("STR R2, [R0]");                          // current_thread = next_tcb (new thread)
    __asm("LDR SP, [R2,#4]");                       // SP = stack_ptr (new thread)
    __asm("POP {R4-R11}");                          // restore context (new thread)


    __asm("CPSIE I");           // enable global interrupts, equivalent to __enable_irq();


    /* return from interrupt */
    __asm("BX LR");
}

OS_Tick and Systick_Tick are two uint32_t variables. os_kernel_threads_list is a tcb_list variable, see below:

/*
 * Thread Control Block (TCB) structure
 */
typedef struct tcb_
{
    struct tcb_ *next_tcb;                      // linked-list, pointer to the next thread
    int32_t *stack_ptr;                         // pointer to the top of the thread's stack (next item to pop / last value stacked)
    int32_t stack[THREAD_STACK_SIZE];           // thread's stack
} tcb_struct;


/*
 * Circular linked-list of threads.
 */
typedef struct
{
    tcb_struct *current_thread;                 // pointer to the current running thread
    tcb_struct threads[N_MAX_THREADS];          // array of threads
    int n_threads;                              // number of threads created
} tcb_list;

Threads are stored in an array, and connected in a circular-linked-list fashion.

EDIT2: additionnal info: here are my clocks settings:

PLL Source: crystal oscillator @25MHz

SYSCLK = PLL_CLK = 216MHz

Flash Wait States = 7WS as recommended in the STM32 datasheet.

Solution

You basically answered this question yourself. You will see this kind of thing on high performance cores like ARM and MIPS and such, you will not be able to see this (not that it is not happening) on high overhead cores like x86.

Now to be fair you may be seeing other effects and running in an RTOS with other overhead. But I can easily demonstrate this without any of those things, bare metal, no interrupts, etc. Just the core doing the one thing at a time.

We can all visualize what happens with caching that the first time through there may be a cache miss and that causes potentially a huge delay (depends on the system, not necessarily huge on an mcu) the first time then the second time, assuming it is in the cache it is faster. Likewise if you align the loop to be near the end of a cache line so that the loop itself or a/the next prefetch goes into a second cache line. Making that first loop take even longer. Same thing with what I call fetch lines but worse, because, by themselves, they do not get faster the second time. Now some branch predictors will help.

Branch prediction is not necessarily super smart looking logic that tries to decode an instruction and look ahead at instructions/results that might cause that branch to happen and then as a result start a fetch. Instead it is more likely that there is a tiny cache of addresses, when it executes an address the first time and that address caused or could cause a fetch, they add that to this short list and as you approach that address (even if you self modify the code) it will toss out a prefetch. Now that can hurt if you have to happening. But the reality is that branch prediction simply starts a prefetch a few clocks earlier than it would have anyway (which is good, but in no way magic nor complicated logic).

I'm on the NUCLEO-F767ZI because it is what I have handy. It is going to be the same cortex-m7 that is in your chip. We cannot guarantee the whole chip is the same (st makes the chip not arm remember). But with using more STM32 chips than I can count over many years whole range from one end to the other, the cortex-m7 infrastructure is going to be more similar than different. With the cortex-m7 st has more flexibility and you will see that while they still support their classic 0x08000000 address the ITCM address is 0x00200000 and that is what you should use for linking for these parts. Obviously you can and should try this on your chip. You should see similar results.

I am not using any code from st or any one else, all my code, written from the ARM and ST documents. Switching over to the on board crystal to get a more reliable uart reference clock. Setting up the uart to print results. Newer parts are better, but for a long time on many vendors parts, zero wait states means the flash is running at half the system rate. When you up the speed of the system you used to and may still need to add wait states. Maxing out clocks does not make your system run faster, you are still bound by the flash speed limit (we have not been processor bound in a long time) the core can run instructions in bursts fast but then has to wait more clocks for instructions. Or peripherals if the peripherals have another. I did not look up your chips documentation, this one definitely has a table of power to system clock to wait states table. You are welcome to clock up and properly wait state and repeat these experiments. It should be the same or on par with just running at say 8MHz and adding those wait states. Basically you are welcome to max out clocks, but you will still see the issue I am demonstrating, you may have additional issues but it is trivial to demonstrate this one.

I think of this as a party trick that may not work at an actual party but you can baffle or amuse your coworkers. Note that you have to have a fetch line when you see the more primitive cortex-ms with a halfword or word sized fetch, you might not be able to see this. The m7 though.

Prefetch Unit

The Prefetch Unit (PFU) provides:

64-bit instruction fetch bandwidth.

4x64-bit pre-fetch queue to decouple instruction pre-fetch from DPU pipeline operation.

A Branch Target Address Cache (BTAC) for single-cycle turn-around of branch predictor state and target address.

A static branch predictor when no BTAC is specified.

Forwarding of flags for early resolution of direct branches in the decoder and first execution stages of the processor pipeline.

The first code under test is

.balign 0x100
/* r0 count */
/* r1 timer address */
.thumb_func
.globl TEST
TEST:
    push {r4,r5}
    ldr r4,[r1]

loop:
    sub r0,#1
    bne loop

    ldr r5,[r1]
    sub r0,r4,r5
    pop {r4,r5}
    bx lr

    nop
    nop
    nop
    nop

Just a simple count loop (not unified syntax). Nicely aligned.

I am using the systick timer, I could go through a demonstration that the DWT timer gives the same results. True that some chip vendors put a divisor on the systick. Okay, okay, the st docs show a divide by 8 but I think that is a long standing typo...

Start with systick, check dwt, then back to systick if dwt is not 8 times better.

Starting at 0x08000000

08000100 <TEST>:
 8000100:   b430        push    {r4, r5}
 8000102:   680c        ldr r4, [r1, #0]


ra=TEST(0x1000,STK_CVR);  hexstring(ra&0x00FFFFFF);
ra=TEST(0x1000,STK_CVR);  hexstring(ra&0x00FFFFFF);
ra=TEST(0x1000,STK_CVR);  hexstring(ra&0x00FFFFFF);
ra=TEST(0x1000,STK_CVR);  hexstring(ra&0x00FFFFFF);


00001029 
00001006 
00001006 
00001006

That looks like some caching. On other stm32 chips you have this cache flash thing you cannot turn off. On this part and other stm32 cortex-m7s you can. In fact the doc says (FLASH_ACR register) that both the ART and prefetch is disabled. Interesting, those numbers look fishy and if the caching is off how is the first loop different? Is it the systick?

with dwt

The arm docs talked about branch prediction and such, and IMO its not well written (if you try to search your way through). Looks like BTAC is enabled by default and we can turn it off (branch target address cache, caches up some addresses and their destinations for prefetching) in the ACTLR register.

Much better.

00200100 <TEST>:
  200100:   b430        push    {r4, r5}
  200102:   680c        ldr r4, [r1, #0]

Well this chip does not see a performance difference between the two addresses and that is strange. Yet another experiment to figure out.

So 64 bit fetching is four 16 bit things or 2 32 bit things. One assumes the bus is either 32 or 64 bits wide. So for the above we assume that there is one fetch at 0x100 and it starts to run those through the pipe and then goes to fetch the next line after that to have it queued up.

.balign 0x100
nop
/* r0 count */
/* r1 timer address */

Put a nop or something here to change the alignment of our simple test.

00200102 <TEST>:
  200102:   b430        push    {r4, r5}
  200104:   680c        ldr r4, [r1, #0]


00005002 
00005002 
00005002 
00005002

and there you go. Exact same machine code, same chip, same system, same everything except that exact same machine code is on a different alignment.

Note with BTAC enabled you still get a different execution time.

What if we add another nop and put the two instructions in the middle of a fetch line.

00200104 <TEST>:
  200104:   b430        push    {r4, r5}
  200106:   680c        ldr r4, [r1, #0]

00004003 
00004003 
00004003 
00004003

Also:

00200106 <TEST>:
  200106:   b430        push    {r4, r5}
  200108:   680c        ldr r4, [r1, #0]

00004003 
00004003 
00004003 
00004003

Interesting. I am going to switch to sram, on a number of mcus the flash is slower than sram even at "zero wait states". Using sram lets me do some self-modifying-code, more than one test per run.

First number is the number of nops in front of the code under test, controlling the alignment.

00000000 00004003 
00000000 00004003 
00000000 00004003 
00000000 00004003 
00000001 00005002 
00000001 00005002 
00000001 00005002 
00000001 00005002 
00000002 00004003 
00000002 00004003 
00000002 00004003 
00000002 00004003 
00000003 00004003 
00000003 00004003 
00000003 00004003 
00000003 00004003 
00000004 00004003 
00000004 00004003 
00000004 00004003 
00000004 00004003 
00000005 00005002 
00000005 00005002 
00000005 00005002 
00000005 00005002 
00000006 00004003 
00000006 00004003 
00000006 00004003 
00000006 00004003 
00000007 00004003 
00000007 00004003 
00000007 00004003 
00000007 00004003

So going from 0x20002000 to 0x20002002 and from 4x2=8 0x2008 to 0x200A. 8 bytes is, 64 bits. So messing with the fetch line clearly gives us two results for the same machine code. I was wrong above the two instructions are 4 bytes which is a quarter of the fetch line? You would think that if on a 64 bit alignment and one past it causes it slower then you would almost expect other misalignments, its 4 bytes out of 64, would also be slow. I will stop trying to analyze it; I do not have access to sim the core, and if I did I could not talk about it anyway.

We see, at least on this chip, that flash and cache, at zero wait states on the flash, are the same not slower on flash. Flashes in mcus are getting better.

00200100 <TEST>:
  200100:   b430        push    {r4, r5}
  200102:   680c        ldr r4, [r1, #0]

00200104 <loop>:
  200104:   3801        subs    r0, #1
  200106:   d1fd        bne.n   200104 <loop>

the loop is not at 0x100 it is at 0x104 to start then we move it to 0x106 making the bne at 0x108 in the next fetch line. I have seen cores fetch the second line after a branch right away, this one might be waiting, do not know, I do not have access to it.

Anyway.

00200100 <TEST>:
  200100:   b430        push    {r4, r5}
  200102:   680c        ldr r4, [r1, #0]

00200104 <loop>:
  200104:   46c0        nop         ; (mov r8, r8)
  200106:   3801        subs    r0, #1
  200108:   d1fc        bne.n   200104 <loop>

If I stick a nop in the loop

and that makes sense.

Putting the nop between them same result.

If we use sram and different numbers of alignments.

00000000 00005002 
00000000 00005002 
00000000 00005002 
00000001 00005002 
00000001 00005002 
00000001 00005002 
00000001 00005002 
00000002 00005002 
00000002 00005002 
00000002 00005002 
00000002 00005002 
00000003 00005002 
00000003 00005002 
00000003 00005002 
00000003 00005002 
00000004 00005002 
00000004 00005002 
00000004 00005002 
00000004 00005002 
00000005 00005002 
00000005 00005002 
00000005 00005002 
00000005 00005002 
00000006 00005002 
00000006 00005002 
00000006 00005002 
00000006 00005002 
00000007 00005002 
00000007 00005002 
00000007 00005002 
00000007 00005002

It is not just every loop that has this issue.

Two nops in the loop

00000000 00005003 
00000000 00005003 
00000000 00005003 
00000000 00005003 
00000001 00006002 
00000001 00006002 
00000001 00006002 
00000001 00006002 
00000002 00005003 
00000002 00005003 
00000002 00005003 
00000002 00005003 
00000003 00005003 
00000003 00005003 
00000003 00005003 
00000003 00005003 
00000004 00005003 
00000004 00005003 
00000004 00005003 
00000004 00005003 
00000005 00006002 
00000005 00006002 
00000005 00006002 
00000005 00006002 
00000006 00005003 
00000006 00005003 
00000006 00005003 
00000006 00005003 
00000007 00005003 
00000007 00005003 
00000007 00005003 
00000007 00005003

(Think about the value of benchmarks if processors are or can be this sensitive).

You can will see these performance difference change in real applications. Adding or removing code in a completely unrelated function can have a cascade effect through the whole binary of where things land. Some loops are going to be position sensitive and some not. Loops wrapping loops, loops with multiple loops inside can cancel each other out or magnify the problem.

0x2004/0x1006 = 199.8 percent. More than the double digit I commented on somewhere.

00200100 <TEST>:
  200100:   b430        push    {r4, r5}
  200102:   680c        ldr r4, [r1, #0]

00200104 <loop>:
  200104:   3801        subs    r0, #1
  200106:   d1fd        bne.n   200104 <loop>

With BTAC enabled, we saw before.

PRFTEN in the FLASH_ACR register no change.

ART accelerator on, no change.

So I kept playing with the FLASH_ACR register. Looks like we are processor bound. Or there is some caching going on somewhere.

Now when you move up to an RTOS for example. Not necessarily in this core but adding an instruction cache will help things but remember with or without cache you are still fetching the same lines and having the same fetch line boundary issues. The backing memory may be faster at times, but alignment issues will still be there (on systems that you can detect it in the first place). Adding an MMU which we do not have on the cortex-m, they defined memory regions so that you do not have to use a cache to tell the system what is non-cached peripherals what is instruction memory what is data memory. An mmu adds its own performance hits and often more than one way to map a virtual address space to physical, but how that is mapped can/will have performance issues, etc etc. You might be experiencing more issues with your rtos, but if you are strictly getting two specific numbers with the same machine code with different alignments, you are probably falling into a simple issue of number of fetches per loop.

Short answer:

You basically have done all the work and found the answer.

The cortex-m7 has a 64 bit fetch. Where you are in that fetch line and the pipe when it determines it needs to branch backward and do that fetch affects now many total fetches per loop. If you keep the same exact machine code and move that through address space, some loops can have an additional fetch per loop. Zero wait state does not mean zero clocks, fetches are not free and it is not all system memory that is involved in a fetch. When the fetch arrives relative to when the pipe is ready to take it for the branch will also affect loop performance. How big the loop is in code size determines the extra fetch penalty, three fetches per loop being sometimes four and ten being sometimes 11 are going to have a different relative hit. 20%, 10%, etc. It is not related to the alignment in rom/flash specifically but in general, if you were to run that code in sram you should also be able to find two execution times.

Note do not assume that any other core is exactly the same as the cortex-m7, the other cortex-ms either do not have enough of a fetch to see this. Different pipes, etc. Some may take more system clocks to do the same thing. The older cores are mated up with older flash technology and those system designs may show, for example, that running from flash and running from sram, same code, has different performance.

In the wide STM32 family there is also a flash cache and prefetch that you cannot disable on some chips, making performance analysis like this much harder. While other brands can buy the same cores note that there are compile time options for the cores and the ARM core is just one part of that chip, the rest of the logic is someone else's and no two companies are assumed to have the same IP, and of course some of that IP is the chip companies and not purchased. You should see this on all cortex-m7 based chips for example, but the exact difference may vary.

Running on an RTOS with RTOS effects can also cause performance issues. For this specific ARM core one can generate two different execution times for the same code, and I believe that you have simply found that. You may find other performance issues on top of this.

In your case the single loop time varies based on alignment, so instead of measuring X loops you are measuring how many loops in Y time, same deal more clocks per loop is less counts per second. You are not tightly measuring time like I am (Michael Abrash: Zen of assembly language) so your interrupt latencies and overhead can have an effect here. Personally I think from your results and from how this core works, etc the task switch time should be equal for both of these tasks for each compile at least. Guess I am just putting up a disclaimer that you may be seeing something else in addition to the above.

To the question of what, if anything, can we do about this. Well this starts off with the answer of the premature optimization discussion. Once you have decided for some reason that you do want to do some optimization and you have isolated that code, then what do you do. You can sometimes just change the C code to help the compiler make something simpler or faster (remember fewer instructions does not mean faster...faster means faster...and faster is relative to the system so the same code on one system may be slower on another and vice versa).

If that does not work a common solution is to let the compiler do the work first then hand optimize. So using gcc we know that the toolchain steps pass through the assembler, the C code is compiled into assembly language then the assembler is called to turn into an object then the linker is called to make it a binary (gcc the binary itself is not a compiler it just launches a number of other programs that actually do the work). I find that output painful and would rather work from a disassembly, but your experience may be different. In any case now your function or whatever is assembly language and you manage the hand tuning that way. You could also just write the critical code from scratch in real assembly. That could work in this case. You can see though what maintenance looks like and where the premature optimization folks are coming from.

For this specific platform/code we could do some tricks though knowing what we know, it is still quite manual and can be a lot of maintenance.

volatile unsigned int ctr1,ctr2;

void thread1(void)
{
    while(1)
    {
        ctr1++;
    }
}

void thread2(void)
{
    while(1)
    {
        ctr2++;
    }
}

diss

Disassembly of section .text:

00200000 <reset-0x8>:
  200000:   20001000    andcs   r1, r0, r0
  200004:   00200009    eoreq   r0, r0, r9

00200008 <reset>:
  200008:   e7fe        b.n 200008 <reset>
    ...

0020000c <thread1>:
  20000c:   4a02        ldr r2, [pc, #8]    ; (200018 <thread1+0xc>)
  20000e:   6813        ldr r3, [r2, #0]
  200010:   3301        adds    r3, #1
  200012:   6013        str r3, [r2, #0]
  200014:   e7fb        b.n 20000e <thread1+0x2>
  200016:   bf00        nop
  200018:   00200030    eoreq   r0, r0, r0, lsr r0

0020001c <thread2>:
  20001c:   4a02        ldr r2, [pc, #8]    ; (200028 <thread2+0xc>)
  20001e:   6813        ldr r3, [r2, #0]
  200020:   3301        adds    r3, #1
  200022:   6013        str r3, [r2, #0]
  200024:   e7fb        b.n 20001e <thread2+0x2>
  200026:   bf00        nop
  200028:   0020002c    eoreq   r0, r0, ip, lsr #32

Disassembly of section .bss:

0020002c <ctr2>:
  20002c:   00000000    andeq   r0, r0, r0

00200030 <ctr1>:
  200030:   00000000    andeq   r0, r0, r0

That is linked with some bogus code to demonstrate something. If you disassemble the object it starts at 00000000 and the loops are at some alignment from there. And they should remain, relative to each other, static. Their overall alignment is affected by code that comes before. With -save-temps we can see the output of the compiler.

    .cpu cortex-m7
    ...
    .text
    .align  1
    .p2align 2,,3
    .global thread1
    .arch armv7e-m
    .syntax unified
    .thumb
    .thumb_func
    .fpu softvfp
    .type   thread1, %function
thread1:
    @ Volatile: function does not return.
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    ldr r2, .L4
.L2:
    ldr r3, [r2]
    adds    r3, r3, #1
    str r3, [r2]
    b   .L2
.L5:
    .align  2
.L4:
    .word   ctr1
    .size   thread1, .-thread1
    .align  1
    .p2align 2,,3
    .global thread2
    .syntax unified

It has some .alignments in there some .p2align, you can go look up and play with those (Sprinkle them in with some .bytes and see how they affect the next thing, see if/that they pad with nops or what they pad with).

00200000 <reset-0x8>:
  200000:   20001000    andcs   r1, r0, r0
  200004:   00200009    eoreq   r0, r0, r9

00200008 <reset>:
  200008:   e7fe        b.n 200008 <reset>
    ...

0020000c <thread1>:

The ... is likely a nop for padding to get thread1 to align on a one word boundary. Let's try something.

.thumb
.word 0x20001000
.word reset

.thumb_func
reset: b reset
nop

gives

00200000 <reset-0x8>:
  200000:   20001000    andcs   r1, r0, r0
  200004:   00200009    eoreq   r0, r0, r9

00200008 <reset>:
  200008:   e7fe        b.n 200008 <reset>
  20000a:   46c0        nop         ; (mov r8, r8)

0020000c <thread1>:

same alignment.

  20000e:   6813        ldr r3, [r2, #0]
  200010:   3301        adds    r3, #1
  200012:   6013        str r3, [r2, #0]
  200014:   e7fb        b.n 20000e <thread1+0x2>

  20001e:   6813        ldr r3, [r2, #0]
  200020:   3301        adds    r3, #1
  200022:   6013        str r3, [r2, #0]
  200024:   e7fb        b.n 20001e <thread2+0x2>

that was not intentional that the two loops have potentially the same alignment.

Maybe we want to push it another word.

volatile unsigned int ctr1,ctr2;

asm ("nop");

void thread1(void)
{

How awful is that, a certain percentage are cringing right now; another percentage are thinking "you can do that like that! wow!"

  20000a:   46c0        nop         ; (mov r8, r8)
  20000c:   bf00        nop
  20000e:   bf00        nop

00200010 <thread1>:
  200010:   4a02        ldr r2, [pc, #8]    ; (20001c <thread1+0xc>)
  200012:   6813        ldr r3, [r2, #0]
  200014:   3301        adds    r3, #1
  200016:   6013        str r3, [r2, #0]
  200018:   e7fb        b.n 200012 <thread1+0x2>
  20001a:   bf00        nop
  20001c:   00200034    eoreq   r0, r0, r4, lsr r0

00200020 <thread2>:
  200020:   4a02        ldr r2, [pc, #8]    ; (20002c <thread2+0xc>)
  200022:   6813        ldr r3, [r2, #0]
  200024:   3301        adds    r3, #1
  200026:   6013        str r3, [r2, #0]
  200028:   e7fb        b.n 200022 <thread2+0x2>
  20002a:   bf00        nop
  20002c:   00200030    eoreq   r0, r0, r0, lsr r0

worked though, pushed them a bit, but more than I wanted.

void thread1(void)
{
asm ("nop");
    while(1)
    {
        ctr1++;
    }
}

void thread2(void)
{
    while(1)
    {
        ctr2++;
    }
}

that worked.

0020000c <thread1>:
  20000c:   bf00        nop
  20000e:   4a02        ldr r2, [pc, #8]    ; (200018 <thread1+0xc>)
  200010:   6813        ldr r3, [r2, #0]
  200012:   3301        adds    r3, #1
  200014:   6013        str r3, [r2, #0]
  200016:   e7fb        b.n 200010 <thread1+0x4>
  200018:   00200030    eoreq   r0, r0, r0, lsr r0

0020001c <thread2>:
  20001c:   4a02        ldr r2, [pc, #8]    ; (200028 <thread2+0xc>)
  20001e:   6813        ldr r3, [r2, #0]
  200020:   3301        adds    r3, #1
  200022:   6013        str r3, [r2, #0]
  200024:   e7fb        b.n 20001e <thread2+0x2>
  200026:   bf00        nop
  200028:   0020002c    eoreq   r0, r0, ip, lsr #32

pushed it from 0x20000e to 0x200010, which is maybe the alignment I want, at least I can move it a little. It does add an extra instruction overall in the execution path, but...aligned one loop.

Do the same with the other and now:

  200010:   6813        ldr r3, [r2, #0]
  200012:   3301        adds    r3, #1
  200014:   6013        str r3, [r2, #0]
  200016:   e7fb        b.n 200010 <thread1+0x4>

  200020:   6813        ldr r3, [r2, #0]
  200022:   3301        adds    r3, #1
  200024:   6013        str r3, [r2, #0]
  200026:   e7fb        b.n 200020 <thread2+0x4>

I have changed their alignment. Ugly? Yes, but is it less ugly than writing asm from scratch or converting to asm? Debatable. But you can see that any code changes that are linked in front of these functions require these to be re-tuned. That gets VERY painful.

volatile unsigned int ctr1,ctr2;

asm (".balign 0x10; .word 0,0,0");
void thread1(void)
{
asm ("nop");
    while(1)
    {
        ctr1++;
    }
}

void thread2(void)
{
asm ("nop");
    while(1)
    {
        ctr2++;
    }
}

This though should work so long as we do not change these functions and do not change compilers or compiler options, basically if the compiler keeps generating the same machine code, then this hack will keep them aligned. Yes it is not elegant, but if it works is it stupid?

0020001c <thread1>:
  20001c:   bf00        nop
  20001e:   4a02        ldr r2, [pc, #8]    ; (200028 <thread1+0xc>)
  200020:   6813        ldr r3, [r2, #0]
  200022:   3301        adds    r3, #1
  200024:   6013        str r3, [r2, #0]
  200026:   e7fb        b.n 200020 <thread1+0x4>
  200028:   00200040    eoreq   r0, r0, r0, asr #32

0020002c <thread2>:
  20002c:   bf00        nop
  20002e:   4a02        ldr r2, [pc, #8]    ; (200038 <thread2+0xc>)
  200030:   6813        ldr r3, [r2, #0]
  200032:   3301        adds    r3, #1
  200034:   6013        str r3, [r2, #0]
  200036:   e7fb        b.n 200030 <thread2+0x4>
  200038:   0020003c    eoreq   r0, r0, ip, lsr r0

What was in the 1000 lines/characters I deleted. Basically you are at the mercy of the compiler or hand tuning assembly language for your specific platform, with some armor around it to not let the tools mess up your alignment. if for example you are trying to do some count the idle time thing, then you would want hand tuned asm....or actually the "it has a pipeline stupid" folks will come out of the woodwork and tell you to use a timer instead. And they are right.

Trying one size fits all with hand tuning...just going to fail.