I have recently become interested in linker scripts and assembly coding for MCUs. I just discovered that the first thing we do in the reset_handler is set the stack pointer register (sp
).
My question is: all the tutorials and example codes from ST show the _estack
variable at the end of the RAM, which is 0x20005000. However, the flash starts at 0x08000000, and my compiled ELF file's vector table is at the beginning of flash (0x08000000), but we set the stack pointer to _estack
(0x20005000), which is not the beginning of flash. Yet, the code works.
How can the stack pointer point to 0x20005000, while the instructions start at 0x08000000 (vector table), and it still works?
My basic linker script:
_estack = 0x20005000;
MEMORY
{
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 64K
RAM (rxw) : ORIGIN = 0x20000000, LENGTH = 20K
}
My assembly files reset_handler (Relevant part):
.type reset_handler, %function
reset_handler:
LDR r0, =_estack
MOV sp, r0
Traditional/textbook memory layout is, your program, instructions, machine code, start at the lowest address in memory (think ram on a computer, not flash and ram separately like an mcu for now). Processors execute instructions linearly in an increasing address until there is a branch/jump and then continue to execute linearly from a low address to high. The stack grows downward. As you add something to it, the address gets lower, the opposite of a stack, you would write something on a note card and stack it on the rest of the note cards, when done with it you remove the ones on the top and discard them. Think upside down, sticky notes sticking to the ceiling.
Because the code runs linearly in an increasing address mode. Starting at the bottom. If you have the stack grow down from the top, it takes the longest time for you to have a stack overflow (the name of the site) where the stack grows so far that it starts to write over other stuff, corrupting it.
The missing piece is the heap, which you should really never have in an MCU, if you are doing mallocs, you need to ask yourself why...But going with this traditional textbook model.
top of memory stack grows down
.
.
.
.
.
bottom of heap, heap allocates upward
top of .bss/.data global variables
bottom of .bss/.data global variables
top of binary/program
.
.
bottom of memory program grows (is linked) upward
The stack is dynamic while running, each can consume some and if it calls another function then it continues to consume. As you malloc more memory (assuming you do not free it, in a very simplistic model) the heap will grow upward.
The very simple idea is that there is this chunk of ram that is not consumed by the program. To give both heap and stack the most space, to minimize the chance of a stackoverflow, you start the heap at the bottom of the remaining memory and the stack at the top of memory. One grows down, the other grows up.
Stack is traditionally started at the top of RAM, for this reason, it grows down and it reduces the chance of a stackoverflow by keeping the stack as far away as you can from everything else (the top of ram).
0x08000000 is the base of application FLASH, which is not ram, not technically read only, but considered to be read only. Stack has to live in ram, read/write. Which for STM32s starts at 0x20000000 typically (And for most cortex-ms per ARMs rules).
As also pointed out by others, if you read the arm docs, it tells you that for a push or stmdb, (store multiple decrement before) it decrements the stack pointer by the number of registers to store times 32 bits. Then stores the registers in the list, so 0x20005000 or whatever, it is going to not write to that address, the first thing it might push goes to 0x20004FFC. The stack pointer, for this architecture, points at the lowest consumed address and decrements before on store, and increments after on load/pop.
With a program in an operating system, particularly after the time of MMUs, you can define some sort of traditional model with zero at the bottom and some value at the top and the virtual address space can be from zero to N because the MMU can take any number of physical chunks and make them look linear. We do not have an operating system that takes the program from some storage like a hard drive, and loads it into ram (program, data, heap, stack) and runs it.
In an MCU, you have non-volatile storage (flash, prom, rom, etc) and SRAM.
0x200xxxxx
top of SRAM stack grows down
.
.
.
.
.
bottom of heap, heap allocates upward
top of .bss/.data global variables
bottom of .bss/.data global variables bottom of sram.
0x20000000
0x080xxxxx
top of flash
.
.
.
unused space above
special
top of binary/program
.
.
bottom of FLASH program grows (is linked) upward
0x08000000
The flash is considered read only. Non-const variables are read-write, they have to be in sram. When you have C code like this. Global variables.
unsigned int x = 5;
unsigned int y;
When we have an operating system lots of ram, and a file system. The memory location that the linker has chosen for x will get loaded with a 5 before main() is called and the location for y will get zero. How do we do that in an MCU? This is why you see the ram AT > rom in the linker scripts. You are telling the linker x needs to be in the defined read/write memory space, but that is volatile, it goes away when power is off so somewhere...in flash...we have to have a copy of what x is going to be, and then the linker script and the bootstrap have to get married and work together to have startup code (bootstrap, code that runs between reset and main()), that copies a 5 from the location in flash that the linker chose to the location ram that the linker chose (.data). For y we do not have to store zeros in the flash, we just store the start address and how much (or the start and end address, your linker script and bootstrap marriage, your choice) and the boot strap just fills that with zeros.
So...how does the stack pointer work on an stm32f103, which is a cortex-m, which works like a great number of processors including the cortex-m3.
Arm has, for the cortex-ms, push and pop instructions. Push is defined in the documentation as subtracting the number of registers times 4 bytes from the stack pointer, and then writing the registers in the list in the instruction to that location. Decrement first. A pop, increments after, it is defined as reading the registers from the current stack pointer address, then adding 4 times the number of registers in the list to the stack pointer. Stack grows downward.
The rules of any MCU is that some part of the address space is flash backed and some is sram backed. Readonly stuff (.text, .rodata, etc) are in the flash address space and the read/write stuff (.data, .bss, etc) are in the sram space.
The sram space contains stack at the top of the unallocated space (unallocated by the linker) and heap at the bottom of the unallocated space (if you feel the need for a heap, really have to think about your system engineering). The global variables are at the bottom of sram, heap starts just above that and stack at the top.
0x200xxxxx
top of SRAM stack grows down
.
.
.
.
.
bottom of heap, heap allocates upward
top of .bss/.data global variables
bottom of .bss/.data global variables bottom of sram.
0x20000000
Programs execute up, forward, in increasing address order until acted on by a branch or jump, so you start your binary in low memory. The linker starts at the bottom (other than booting rules for the processor, the vector table or reset handler are where the processor says they are) with the program, machine code, .text. On top of that will typically sit the non-volatile copy of global stuff like .data and .bss (global offset table, etc), and then ideally you have flash left over because you didnt push it to the limit and that just sits unused.
0x080xxxxx
top of flash
.
.
.
unused space above
.
.data/.bss/etc non-volatile copy start here
top of binary/program, read-only variables
.
.
bottom of FLASH program grows (is linked) upward
0x08000000
That is the textbook model and very often what you will see in the real world. You as the bare metal programmer are responsible for the memory space and how it is used, you might choose, for various reasons, to just take someone elses and reuse it (thus why "everyone uses _stack"...they do not you just did not look everywhere). That is still a choice you make for controlling the address space, choosing to just use someone elses. You can certainly mix things around you could have stack grow down from the middle and heap grow up. If you know your program is light on nesting and light on local variables (globals are very good for baremetal MCU work, maybe bad in a textbook sense but very good for baremetal, locals are bad). You might choose to set it for 0x20002000 or even smaller, so that when you copy this code from one mcu to another you do not have to keep tweaking that address in the linker scripts...Another choice.
If you choose to use the other stack pointer, then now you have to make a choice, they cannot sit on top of each other, just like if you find yourself working with an ARM7 or ARM11 based bare metal environment (ARM7TDMI) which has multiple stack pointers so you have to now think about how much memory to give to each and leave some over for variables and heap if you have heap. You still set each stack pointer to the top of the space you have defined for it so it can grow downward (decreasing addresses).
flash.s (reset handler and bootstrap)
.cpu cortex-m3
.thumb
.thumb_func
.word 0x08000800
.word reset
.globl reset
.thumb_func
reset:
bl notmain
b .
notmain.c
unsigned int x = 5;
unsigned int y;
const unsigned int z = 6;
int some_global_function ( void )
{
static unsigned int x = 7;
return(x);
}
int notmain ( void )
{
y = x;
return x;
}
flash.ld
MEMORY
{
FLASH : ORIGIN = 0x08000000, LENGTH = 1K
SRAM : ORIGIN = 0x20000000, LENGTH = 1K
}
SECTIONS
{
.text : { *(.text*) } > FLASH
.rodata : { *(.rodata*) } > FLASH
.bss : { *(.bss*) } > SRAM AT >FLASH
.data : { *(.data*) } > SRAM AT >FLASH
}
build
arm-none-eabi-gcc -Wall -O2 -mthumb -c notmain.c -o notmain.o
arm-none-eabi-objdump -D notmain.o > notmain.c.list
arm-none-eabi-as --warn --fatal-warnings flash.s -o flash.o
arm-none-eabi-ld -T flash.ld flash.o notmain.o -o notmain.elf
arm-none-eabi-objdump -D notmain.elf > notmain.list
arm-none-eabi-objcopy -O binary notmain.elf notmain.bin
arm-none-eabi-objcopy --srec-forceS3 notmain.elf -O srec notmain.srec
examine
notmain.list
Disassembly of section .text:
08000000 <reset-0x8>:
8000000: 08000800 stmdaeq r0, {fp}
8000004: 08000009 stmdaeq r0, {r0, r3}
08000008 <reset>:
8000008: f000 f804 bl 8000014 <notmain>
800000c: e7fe b.n 800000c <reset+0x4>
...
08000010 <some_global_function>:
8000010: 2007 movs r0, #7
8000012: 4770 bx lr
08000014 <notmain>:
8000014: 4b02 ldr r3, [pc, #8] ; (8000020 <notmain+0xc>)
8000016: 6818 ldr r0, [r3, #0]
8000018: 4b02 ldr r3, [pc, #8] ; (8000024 <notmain+0x10>)
800001a: 6018 str r0, [r3, #0]
800001c: 4770 bx lr
800001e: 46c0 nop ; (mov r8, r8)
8000020: 20000004 andcs r0, r0, r4
8000024: 20000000 andcs r0, r0, r0
Disassembly of section .rodata:
08000028 <z>:
8000028: 00000006 andeq r0, r0, r6
Disassembly of section .bss:
20000000 <y>:
20000000: 00000000 andeq r0, r0, r0
Disassembly of section .data:
20000004 <x>:
20000004: 00000005 andeq r0, r0, r5
notmain.srec
S00F00006E6F746D61696E2E737265631F
S31508000000000800080900000800F004F8FEE70000F0
S3150800001007207047024B1868024B18607047C046A5
S30D08000020040000200000002086
S3090800002806000000C0
S3090800002C05000000BD
S70508000000F2
reformatted
S315 08000000 000800080900000800F004F8FEE70000
S315 08000010 07207047024B1868024B18607047C046
S30D 08000020 0400002000000020
S309 08000028 06000000
S309 0800002C 05000000
The global variables
Disassembly of section .bss:
20000000 <y>:
20000000: 00000000 andeq r0, r0, r0
Disassembly of section .data:
20000004 <x>:
20000004: 00000005 andeq r0, r0, r5
The local global or static local x within the function is ruled by that function and because we treated it like a local variable it did not get global storage as one would expect.
08000010 <some_global_function>:
8000010: 2007 movs r0, #7
8000012: 4770 bx lr
We will fix this in a minute.
I told and you with the (rx) and (rwx), which I STRONGLY recommend against, you are just fighting yourself and the linker chooses which part of your linker script to ignore...without warning, in my case the > SRAM AT > FLASH, and the order I placed the items in the linker script, to have .bss and then .data in the sram address space. But also we see x in the flash.
S309 08000028 06000000 x, rodata, as defined, after .text
S309 0800002C 05000000 x, as defined in the ld script after .rodata
you can hexdump the binary as well and see it.
I did not make a complete, functional, linker script/bootstrap, it gets ugly to support .data and .bss, I wanted to show the construction of the binary and how the stack pointer gets up there and why and what lives down below. For .bss you add linker script variables and then those, if used in the code somewhere (never chicken and egg it, always use assembly for your bootstrap).
You can see the code grows up (increasing addresses) from 0x08000000.
The stack pointer, for a cortex-m, if you choose to use this feature (you do not have to, you have to have something there and it will get loaded, but you can also have code do it in the bootstrap) is the first word on the vector table.
08000000 <reset-0x8>:
8000000: 08000800 logic loads sp with 0x08000800 before reset handler
8000004: 08000009 reset vector (address of handler ORRed with 1)
08000008 <reset>: reset handler
The logic dictates how the vector table works and where it is (technically at 0x00000000 but for stm32 and some but not all others that is mirrored to some other address, 0x08000000 for most stm32s, some STM32s a faster bus at 0x00200000 (such that 0x00000000, 0x00200000 and 0x08000000 all answer with the vector table plus some code). After the vector table it is up to me, the bare metal programmer how the flash and ram are laid out. And up to me to make sure that if I use .data and .bss that they are prepared before main. The three basic rules for C bootstrap is, init stack pointer, copy .data, zero .bss...Might be others based on the libraries, etc linked in.
Stack grows down, to give the stack the most space before overflowing, you start at the top of ram, everything else you grow up from the bottom of ram. Fixed things first then heap which is dynamic last. If the mcu has 0x5000 bytes of ram and sram starts at 0x20000000 then you set the stack pointer for 0x20005000.
Oh yeah I was going to fix the local global.
int some_global_function ( void )
{
static unsigned int x = 7;
return(x++);
}
08000010 <some_global_function>:
8000010: 4b02 ldr r3, [pc, #8] ; (800001c <some_global_function+0xc>)
8000012: 6818 ldr r0, [r3, #0]
8000014: 1c42 adds r2, r0, #1
8000016: 601a str r2, [r3, #0]
8000018: 4770 bx lr
800001a: 46c0 nop ; (mov r8, r8)
800001c: 20000004 andcs r0, r0, r4
Disassembly of section .bss:
20000000 <y>:
20000000: 00000000 andeq r0, r0, r0
Disassembly of section .data:
20000004 <x.4144>:
20000004: 00000007 andeq r0, r0, r7
20000008 <x>:
20000008: 00000005 andeq r0, r0, r5
It lands in .data where it belongs. Really the linker's choice/design as to which to put first the static locals or the globals.