cassemblyarmembeddedstack-pointer

How does stack pointer works on STM32F103


I have recently become interested in linker scripts and assembly coding for MCUs. I just discovered that the first thing we do in the reset_handler is set the stack pointer register (sp).

My question is: all the tutorials and example codes from ST show the _estack variable at the end of the RAM, which is 0x20005000. However, the flash starts at 0x08000000, and my compiled ELF file's vector table is at the beginning of flash (0x08000000), but we set the stack pointer to _estack (0x20005000), which is not the beginning of flash. Yet, the code works.

How can the stack pointer point to 0x20005000, while the instructions start at 0x08000000 (vector table), and it still works?

My basic linker script:

_estack = 0x20005000;

MEMORY
{
    FLASH (rx)      : ORIGIN = 0x08000000, LENGTH = 64K
    RAM (rxw)       : ORIGIN = 0x20000000, LENGTH = 20K
}

My assembly files reset_handler (Relevant part):

.type reset_handler, %function
reset_handler:
    LDR     r0, =_estack
    MOV     sp, r0

Solution

  • Traditional/textbook memory layout is, your program, instructions, machine code, start at the lowest address in memory (think ram on a computer, not flash and ram separately like an mcu for now). Processors execute instructions linearly in an increasing address until there is a branch/jump and then continue to execute linearly from a low address to high. The stack grows downward. As you add something to it, the address gets lower, the opposite of a stack, you would write something on a note card and stack it on the rest of the note cards, when done with it you remove the ones on the top and discard them. Think upside down, sticky notes sticking to the ceiling.

    Because the code runs linearly in an increasing address mode. Starting at the bottom. If you have the stack grow down from the top, it takes the longest time for you to have a stack overflow (the name of the site) where the stack grows so far that it starts to write over other stuff, corrupting it.

    The missing piece is the heap, which you should really never have in an MCU, if you are doing mallocs, you need to ask yourself why...But going with this traditional textbook model.

    top of memory   stack grows down
    .
    .
    .
    .
    .
    bottom of heap, heap allocates upward
    top of .bss/.data global variables
    bottom of .bss/.data global variables
    top of binary/program
    .
    .
    bottom of memory program grows (is linked) upward
    

    The stack is dynamic while running, each can consume some and if it calls another function then it continues to consume. As you malloc more memory (assuming you do not free it, in a very simplistic model) the heap will grow upward.

    The very simple idea is that there is this chunk of ram that is not consumed by the program. To give both heap and stack the most space, to minimize the chance of a stackoverflow, you start the heap at the bottom of the remaining memory and the stack at the top of memory. One grows down, the other grows up.

    Stack is traditionally started at the top of RAM, for this reason, it grows down and it reduces the chance of a stackoverflow by keeping the stack as far away as you can from everything else (the top of ram).

    0x08000000 is the base of application FLASH, which is not ram, not technically read only, but considered to be read only. Stack has to live in ram, read/write. Which for STM32s starts at 0x20000000 typically (And for most cortex-ms per ARMs rules).

    As also pointed out by others, if you read the arm docs, it tells you that for a push or stmdb, (store multiple decrement before) it decrements the stack pointer by the number of registers to store times 32 bits. Then stores the registers in the list, so 0x20005000 or whatever, it is going to not write to that address, the first thing it might push goes to 0x20004FFC. The stack pointer, for this architecture, points at the lowest consumed address and decrements before on store, and increments after on load/pop.

    With a program in an operating system, particularly after the time of MMUs, you can define some sort of traditional model with zero at the bottom and some value at the top and the virtual address space can be from zero to N because the MMU can take any number of physical chunks and make them look linear. We do not have an operating system that takes the program from some storage like a hard drive, and loads it into ram (program, data, heap, stack) and runs it.

    In an MCU, you have non-volatile storage (flash, prom, rom, etc) and SRAM.

    0x200xxxxx
    top of SRAM stack grows down
    .
    .
    .
    .
    .
    bottom of heap, heap allocates upward
    top of .bss/.data global variables
    bottom of .bss/.data global variables bottom of sram.
    0x20000000
    
    0x080xxxxx
    top of flash
    . 
    .
    .
    unused space above
    special
    top of binary/program
    .
    .
    bottom of FLASH program grows (is linked) upward
    0x08000000
    

    The flash is considered read only. Non-const variables are read-write, they have to be in sram. When you have C code like this. Global variables.

    unsigned int x = 5;
    unsigned int y;
    

    When we have an operating system lots of ram, and a file system. The memory location that the linker has chosen for x will get loaded with a 5 before main() is called and the location for y will get zero. How do we do that in an MCU? This is why you see the ram AT > rom in the linker scripts. You are telling the linker x needs to be in the defined read/write memory space, but that is volatile, it goes away when power is off so somewhere...in flash...we have to have a copy of what x is going to be, and then the linker script and the bootstrap have to get married and work together to have startup code (bootstrap, code that runs between reset and main()), that copies a 5 from the location in flash that the linker chose to the location ram that the linker chose (.data). For y we do not have to store zeros in the flash, we just store the start address and how much (or the start and end address, your linker script and bootstrap marriage, your choice) and the boot strap just fills that with zeros.

    So...how does the stack pointer work on an stm32f103, which is a cortex-m, which works like a great number of processors including the cortex-m3.

    Arm has, for the cortex-ms, push and pop instructions. Push is defined in the documentation as subtracting the number of registers times 4 bytes from the stack pointer, and then writing the registers in the list in the instruction to that location. Decrement first. A pop, increments after, it is defined as reading the registers from the current stack pointer address, then adding 4 times the number of registers in the list to the stack pointer. Stack grows downward.

    The rules of any MCU is that some part of the address space is flash backed and some is sram backed. Readonly stuff (.text, .rodata, etc) are in the flash address space and the read/write stuff (.data, .bss, etc) are in the sram space.

    The sram space contains stack at the top of the unallocated space (unallocated by the linker) and heap at the bottom of the unallocated space (if you feel the need for a heap, really have to think about your system engineering). The global variables are at the bottom of sram, heap starts just above that and stack at the top.

    0x200xxxxx
    top of SRAM stack grows down
    .
    .
    .
    .
    .
    bottom of heap, heap allocates upward
    top of .bss/.data global variables
    bottom of .bss/.data global variables bottom of sram.
    0x20000000
    

    Programs execute up, forward, in increasing address order until acted on by a branch or jump, so you start your binary in low memory. The linker starts at the bottom (other than booting rules for the processor, the vector table or reset handler are where the processor says they are) with the program, machine code, .text. On top of that will typically sit the non-volatile copy of global stuff like .data and .bss (global offset table, etc), and then ideally you have flash left over because you didnt push it to the limit and that just sits unused.

    0x080xxxxx
    top of flash
    . 
    .
    .
    unused space above
    .
    .data/.bss/etc non-volatile copy start here
    top of binary/program, read-only variables
    .
    .
    bottom of FLASH program grows (is linked) upward
    0x08000000
    

    That is the textbook model and very often what you will see in the real world. You as the bare metal programmer are responsible for the memory space and how it is used, you might choose, for various reasons, to just take someone elses and reuse it (thus why "everyone uses _stack"...they do not you just did not look everywhere). That is still a choice you make for controlling the address space, choosing to just use someone elses. You can certainly mix things around you could have stack grow down from the middle and heap grow up. If you know your program is light on nesting and light on local variables (globals are very good for baremetal MCU work, maybe bad in a textbook sense but very good for baremetal, locals are bad). You might choose to set it for 0x20002000 or even smaller, so that when you copy this code from one mcu to another you do not have to keep tweaking that address in the linker scripts...Another choice.

    If you choose to use the other stack pointer, then now you have to make a choice, they cannot sit on top of each other, just like if you find yourself working with an ARM7 or ARM11 based bare metal environment (ARM7TDMI) which has multiple stack pointers so you have to now think about how much memory to give to each and leave some over for variables and heap if you have heap. You still set each stack pointer to the top of the space you have defined for it so it can grow downward (decreasing addresses).

    flash.s (reset handler and bootstrap)

    .cpu cortex-m3
    .thumb
    
    .thumb_func
    .word 0x08000800
    .word reset
    
    .globl reset
    .thumb_func
    reset:
        bl notmain
        b .
    

    notmain.c

    unsigned int x = 5;
    unsigned int y;
    const unsigned int z = 6;
    
    int some_global_function ( void )
    {
        static unsigned int x = 7;
        return(x);
    }
    int notmain ( void )
    {
        y = x;
        return x;
    }
    

    flash.ld

    MEMORY 
    {
      FLASH : ORIGIN = 0x08000000, LENGTH = 1K
      SRAM  : ORIGIN = 0x20000000, LENGTH = 1K
    }
    SECTIONS
    {
        .text   : { *(.text*)   } > FLASH
        .rodata : { *(.rodata*) } > FLASH
        .bss    : { *(.bss*)    } > SRAM AT >FLASH
        .data   : { *(.data*)   } > SRAM AT >FLASH
    }
    

    build

    arm-none-eabi-gcc -Wall -O2  -mthumb -c notmain.c -o notmain.o
    arm-none-eabi-objdump -D notmain.o > notmain.c.list
    arm-none-eabi-as --warn --fatal-warnings  flash.s -o flash.o
    arm-none-eabi-ld  -T flash.ld flash.o notmain.o -o notmain.elf
    arm-none-eabi-objdump -D notmain.elf > notmain.list
    arm-none-eabi-objcopy -O binary notmain.elf notmain.bin
    arm-none-eabi-objcopy --srec-forceS3 notmain.elf -O srec notmain.srec
    

    examine

    notmain.list

    Disassembly of section .text:
    
    08000000 <reset-0x8>:
     8000000:   08000800    stmdaeq r0, {fp}
     8000004:   08000009    stmdaeq r0, {r0, r3}
    
    08000008 <reset>:
     8000008:   f000 f804   bl  8000014 <notmain>
     800000c:   e7fe        b.n 800000c <reset+0x4>
        ...
    
    08000010 <some_global_function>:
     8000010:   2007        movs    r0, #7
     8000012:   4770        bx  lr
    
    08000014 <notmain>:
     8000014:   4b02        ldr r3, [pc, #8]    ; (8000020 <notmain+0xc>)
     8000016:   6818        ldr r0, [r3, #0]
     8000018:   4b02        ldr r3, [pc, #8]    ; (8000024 <notmain+0x10>)
     800001a:   6018        str r0, [r3, #0]
     800001c:   4770        bx  lr
     800001e:   46c0        nop         ; (mov r8, r8)
     8000020:   20000004    andcs   r0, r0, r4
     8000024:   20000000    andcs   r0, r0, r0
    
    Disassembly of section .rodata:
    
    08000028 <z>:
     8000028:   00000006    andeq   r0, r0, r6
    
    Disassembly of section .bss:
    
    20000000 <y>:
    20000000:   00000000    andeq   r0, r0, r0
    
    Disassembly of section .data:
    
    20000004 <x>:
    20000004:   00000005    andeq   r0, r0, r5
    

    notmain.srec

    S00F00006E6F746D61696E2E737265631F
    S31508000000000800080900000800F004F8FEE70000F0
    S3150800001007207047024B1868024B18607047C046A5
    S30D08000020040000200000002086
    S3090800002806000000C0
    S3090800002C05000000BD
    S70508000000F2
    

    reformatted

    S315 08000000 000800080900000800F004F8FEE70000
    S315 08000010 07207047024B1868024B18607047C046
    S30D 08000020 0400002000000020
    S309 08000028 06000000
    S309 0800002C 05000000
    

    The global variables

    Disassembly of section .bss:
    
    20000000 <y>:
    20000000:   00000000    andeq   r0, r0, r0
    
    Disassembly of section .data:
    
    20000004 <x>:
    20000004:   00000005    andeq   r0, r0, r5
    

    The local global or static local x within the function is ruled by that function and because we treated it like a local variable it did not get global storage as one would expect.

    08000010 <some_global_function>:
     8000010:   2007        movs    r0, #7
     8000012:   4770        bx  lr
    

    We will fix this in a minute.

    I told and you with the (rx) and (rwx), which I STRONGLY recommend against, you are just fighting yourself and the linker chooses which part of your linker script to ignore...without warning, in my case the > SRAM AT > FLASH, and the order I placed the items in the linker script, to have .bss and then .data in the sram address space. But also we see x in the flash.

    S309 08000028 06000000  x, rodata, as defined, after .text
    S309 0800002C 05000000  x, as defined in the ld script after .rodata
    

    you can hexdump the binary as well and see it.

    I did not make a complete, functional, linker script/bootstrap, it gets ugly to support .data and .bss, I wanted to show the construction of the binary and how the stack pointer gets up there and why and what lives down below. For .bss you add linker script variables and then those, if used in the code somewhere (never chicken and egg it, always use assembly for your bootstrap).

    You can see the code grows up (increasing addresses) from 0x08000000.

    The stack pointer, for a cortex-m, if you choose to use this feature (you do not have to, you have to have something there and it will get loaded, but you can also have code do it in the bootstrap) is the first word on the vector table.

    08000000 <reset-0x8>:
     8000000:   08000800    logic loads sp with 0x08000800 before reset handler
     8000004:   08000009    reset vector (address of handler ORRed with 1)
    
    08000008 <reset>:       reset handler
    

    The logic dictates how the vector table works and where it is (technically at 0x00000000 but for stm32 and some but not all others that is mirrored to some other address, 0x08000000 for most stm32s, some STM32s a faster bus at 0x00200000 (such that 0x00000000, 0x00200000 and 0x08000000 all answer with the vector table plus some code). After the vector table it is up to me, the bare metal programmer how the flash and ram are laid out. And up to me to make sure that if I use .data and .bss that they are prepared before main. The three basic rules for C bootstrap is, init stack pointer, copy .data, zero .bss...Might be others based on the libraries, etc linked in.

    Stack grows down, to give the stack the most space before overflowing, you start at the top of ram, everything else you grow up from the bottom of ram. Fixed things first then heap which is dynamic last. If the mcu has 0x5000 bytes of ram and sram starts at 0x20000000 then you set the stack pointer for 0x20005000.

    Oh yeah I was going to fix the local global.

    int some_global_function ( void )
    {
        static unsigned int x = 7;
        return(x++);
    }
    
    08000010 <some_global_function>:
     8000010:   4b02        ldr r3, [pc, #8]    ; (800001c <some_global_function+0xc>)
     8000012:   6818        ldr r0, [r3, #0]
     8000014:   1c42        adds    r2, r0, #1
     8000016:   601a        str r2, [r3, #0]
     8000018:   4770        bx  lr
     800001a:   46c0        nop         ; (mov r8, r8)
     800001c:   20000004    andcs   r0, r0, r4
    
    
    Disassembly of section .bss:
    
    20000000 <y>:
    20000000:   00000000    andeq   r0, r0, r0
    
    Disassembly of section .data:
    
    20000004 <x.4144>:
    20000004:   00000007    andeq   r0, r0, r7
    
    20000008 <x>:
    20000008:   00000005    andeq   r0, r0, r5
    

    It lands in .data where it belongs. Really the linker's choice/design as to which to put first the static locals or the globals.