cassemblyllvmarm64bare-metal

How to build a bare metal app for aarch64 using LLVM?


I am trying to understand how I can compile and link a bare metal app using LLVM (macOS).

loader.s:

.global _reset
_reset:
    # Set up stack pointer
    LDR X2, =stack_top
    MOV SP, X2
    # Magic number
    MOV X13, #0x1337
    # Loop endlessly
    BL start

main.c:

void start() {
    for(;;) {

    }
}

In GNU GCC I would write a linker script but seems LLVM doesn't support it. So, how can I tell llvm-link to join object files in right order and specify an offset for the code?

P.S. I am a noobie in assembler, unfortunately, so it brings me a lot of problems too. In short, I want to execute a minimal assembler code and jump to C function that will do an infinite loop.

The assembler code above was copied and a little changed from here


Solution

  • novectors.s

    .globl _reset
    _reset:
        mov sp,#0x2000
        bl notmain
        b .
    
    .data
    .word 0x12345678
    

    memmap:

    ENTRY(_reset)
    MEMORY
    {
        ram : ORIGIN = 0x1000, LENGTH = 0x1000
    }
    
    SECTIONS
    {
        .text   : { *(.text*)   } > ram
        .rodata : { *(.rodata*) } > ram
        .bss    : { *(.bss*)    } > ram
        .data   : { *(.data*)   } > ram
    }
    

    notmain.c:

    unsigned int x;
    unsigned int y=5;
    
    void notmain ( void )
    {
        x=y;
    }
    

    build

    clang  -c -Wall  -O2 novectors.s -o novectors.o
    clang  -c -Wall  -O2 -fomit-frame-pointer -fno-exceptions -fno-asynchronous-unwind-tables -fno-unwind-tables notmain.c -o notmain.o
    ld.lld -nostdlib -T memmap novectors.o notmain.o -o notmain.elf 
    llvm-objdump -D notmain.elf > notmain.list
    llvm-objcopy notmain.elf -O binary notmain.bin
    

    diss

    Disassembly of section .text:
    
    0000000000001000 <_reset>:
        1000: b27303ff      orr sp, xzr, #0x2000
        1004: 94000002      bl  0x100c <notmain>
        1008: 14000000      b   0x1008 <_reset+0x8>
    
    000000000000100c <notmain>:
        100c: 90000008      adrp    x8, 0x1000 <_reset>
        1010: 90000009      adrp    x9, 0x1000 <_reset>
        1014: b9402908      ldr w8, [x8, #0x28]
        1018: b9002128      str w8, [x9, #0x20]
        101c: d65f03c0      ret
    
    Disassembly of section .bss:
    
    0000000000001020 <x>:
    ...
    
    Disassembly of section .data:
    
    0000000000001024 <$d.1>:
        1024: 78 56 34 12   .word   0x12345678
    
    0000000000001028 <y>:
        1028: 05 00 00 00   .word   0x00000005
    

    hexdump -C notmain.bin

    00000000  ff 03 73 b2 02 00 00 94  00 00 00 14 08 00 00 90  |..s.............|
    00000010  09 00 00 90 08 29 40 b9  28 21 00 b9 c0 03 5f d6  |.....)@.(!...._.|
    00000020  00 00 00 00 78 56 34 12  05 00 00 00              |....xV4.....|
    0000002c
    

    You can complicate it from there. Qemu is happy with an elf file so you do not have to convert to binary. You do need to figure out what the memory space is for the machine you are using (cortex-a72 is the core not the machine) (it's the alphabet/language not the book)

    There are ways to figure this out using the binary format. So far all of the qemu machine uarts I have used do not require polling/waiting/init so you can just jam characters into the tx buffer. So you can make a small position independent program (no stack, etc) and can you get the pc from aarch64?, you get the pc and print it on the uart (in hex or octal of course, never use printf in bare metal).

    I threw in all those command line options because I was getting an eh_frame so a little Google, and that worked, did not spend any more time on it to find out which one mattered.

    Now you are on macOS and I do not know what your clang/llvm toolchain looks like, for example there are pre-builts that can build for other targets to some extent but my experience is you get one linker for the host, not for targets, so I build my clang/llvm from sources specifically for my cross target (gnu style), a separate toolchain for each target. Fought llvm for years with the generic one and using gnu binutils for assembling and linking, had to start over on the make files each major version. Quite painful. This is much much less painful.

    Where do you find machine info for qemu? Sadly the best way is to get the source code and dig through it, only reliable way I have found. If you find some not overly complicated examples for the same core and machine, then you can maybe steal at least a working memory space, and then look at the qemu sources for the uart tx register.

    qemu is not a real machine, the more time you spend writing code for qemu the less likely it will work on real hardware even if the machine and core are supposed to match. It's a nice or at least tolerable way to learn, you won't brick any hardware, but some day you got to try hardware and deal with it. You can get a 64 bit arm based Raspberry Pi for a few bucks and the baremetal forum at their site is very good, will get you similarly simple bare metal examples that can get you booted (and while technically you can brick a Pi if I understand right, you have to actually work at it, in general you pop the sd card out, and try again).

    llvm claims gnu compatibility with clang being an assembler (ewww), clang and gcc, and their linker and such. Close but not 100%, beware that statements like that are generally going to be false and again, write too much code without checking, the bigger the failure.

    You need to get the arm documentation for the core, cortex-whatever, (it is called a Technical Reference Manual (TRM), directly from arm not from someone else). Then in the TRM it will tell you the architecture (armv-8a or something), and you will need the Architectural Reference Manual (ARM ARM) for that architecture. You do NOT need the programmers reference that creates more confusion than it solves, also do not look at some generic arm 64 bit instruction reference on their site, again more problems than solutions, you need the ARM TRM and ARM ARM for your core. Chip and other stuff is not arm so you have to find that from the chip vendor, or in the case of qemu from the qemu sources.

    In the ARM ARM you will find how the aarch64 cores boot, how their vectors work which is not how the 32 bit ones work and not how the cortex-ms work, they are three different solutions. (and then adjust your bootstrap accordingly to place the handlers as the logic/hardware defined addresses).

    I find the gnu tools easier and you will find considerably more examples out there for gnu. Clang/llvm is not a drop in replacement. Gnu is easier to come by IMO and/or easier to build (cmake sucks, if nothing else it takes almost an order of magnitude longer to build).

    The above could be simpler, but this is not a bad starting point, then you can complicate it from there if you feel the need. Most folks grossly overcomplicate everything, find your ideal base/skeleton.


    Okay the thing you linked (bad to put external links on this site, they are not guaranteed to remain up) covers all of this topic, not sure why you are asking again here?

    bootstrap

    .globl _reset
    _reset:
    
        mov x0,#0x09000000
        mov w1,#0x55
    loop:
        strb w1,[x0]
        b loop
    

    linker script

    ENTRY(_reset)
    MEMORY
    {
        ram : ORIGIN = 0x40000000, LENGTH = 0x1000
    }
    
    SECTIONS
    {
        .text   : { *(.text*)   } > ram
        .rodata : { *(.rodata*) } > ram
    }
    

    build

    clang  -c -Wall  -O2 novectors.s -o novectors.o
    ld.lld -nostdlib -T memmap novectors.o -o notmain.elf 
    llvm-objdump -D notmain.elf > notmain.list
    

    check it first always.

    Disassembly of section .text:
    
    0000000040000000 <_reset>:
    40000000: d2a12000      mov x0, #0x9000000
    40000004: 52800aa1      mov w1, #0x55
    
    0000000040000008 <loop>:
    40000008: 39000001      strb    w1, [x0]
    4000000c: 17ffffff      b   0x40000008 <loop>
    

    run it

    qemu-system-aarch64 -M virt -cpu cortex-a72 -m 128M -nographic -kernel notmain.elf
    

    and it spams the screen with the letter U (0x55).

    ctrl-a then x (at least on linux) to kill it.

    With C.

    bootstrap

    .globl _reset
    _reset:
    
        mov x10,#0x48000000
        mov sp,x10
        bl notmain
        b .
    
    .globl write32
    write32:
        str w1,[x0]
        ret
    

    c code

    extern void write32 ( unsigned int, unsigned int );
    void notmain ( void )
    {
        unsigned int ra;
        for(ra=0x30;;ra++)
        {
            ra&=0x37;
            write32(0x09000000,ra);
        }
    }
    

    linker script

    ENTRY(_reset)
    MEMORY
    {
        ram : ORIGIN = 0x40000000, LENGTH = 0x1000
    }
    
    SECTIONS
    {
        .text   : { *(.text*)   } > ram
        .rodata : { *(.rodata*) } > ram
    }
    

    build

    clang  -c -Wall  -O2 novectors.s -o novectors.o
    clang  -c -Wall  -O2 -fomit-frame-pointer -fno-exceptions -fno-asynchronous-unwind-tables -fno-unwind-tables notmain.c -o notmain.o
    ld.lld -nostdlib -T memmap novectors.o notmain.o -o notmain.elf 
    llvm-objdump -D notmain.elf > notmain.list
    

    run the same way as above and it spews 0123456701234567 until you stop it.

    Let's see where a -O binary would be placed...

    linker script

        ENTRY(_reset)
        MEMORY
        {
            ram : ORIGIN = 0x00000000, LENGTH = 0x1000
        }
        
        SECTIONS
        {
            .text   : { *(.text*)   } > ram
            .rodata : { *(.rodata*) } > ram
        }
        
    

    bootstrap

        .globl _reset
        _reset:
        
            mov x10,#0x48000000
            mov sp,x10
            bl notmain
            b .
        
        .globl write32
        write32:
            str w1,[x0]
            ret
        
        
        .globl getpc
        getpc:
            mov x1,lr
            bl here
        here:
            mov x0,lr
            mov lr,x1
            ret
        
    

    c code

        extern unsigned int getpc ( void );
        extern void write32 ( unsigned int, unsigned int );
        void notmain ( void )
        {
            unsigned int ra;
            unsigned int rb;
            unsigned int rc;
        
            ra=getpc();
            rb=32;
            while(1)
            {
                rb-=4;
                rc=(ra>>rb)&0xF;
                if(rc>9) rc+=0x37; else rc+=0x30;
                write32(0x09000000,rc);
                if(rb==0) break;
            }
            write32(0x09000000,0x0D);
            write32(0x09000000,0x0A);
        
        }
    

    build

        clang  -c -Wall  -O2 -fomit-frame-pointer -fno-exceptions -fno-asynchronous-unwind-tables -fno-unwind-tables notmain.c -o notmain.o
        ld.lld -nostdlib -T memmap novectors.o notmain.o -o notmain.elf 
        llvm-objdump -D notmain.elf > notmain.list
        llvm-objcopy notmain.elf -O binary notmain.bin
        
    

    run

        qemu-system-aarch64 -M virt -cpu cortex-a72 -m 128M -nographic -kernel notmain.bin
        40080020
    

    So if we switch to

    ENTRY(_reset)
    MEMORY
    {
        ram : ORIGIN = 0x40080000, LENGTH = 0x1000
        nada: ORIGIN = 0xFFFFFFF0, LENGTH = 0
    }
    SECTIONS
    {
        .text   : { *(.text*)   } > ram
        .rodata : { *(.rodata*) } > ram
        .bss    : { *(.bss*)    } > nada
        .data   : { *(.data*)   } > nada
    }
    

    You can build and run the elf or the bin file and should be just fine (with the .bin you can trivially have .bss and .data, for the elf you have to do some ugly linker script stuff and then some bootstrap.)

    If you do not let yourself use any .data and do not expect .bss to be zeros then your bootstrap and linker script can be trivial.

    ENTRY(_reset)
    MEMORY
    {
        ram : ORIGIN = 0x40080000, LENGTH = 0x1000
        nada: ORIGIN = 0xFFFFFFF0, LENGTH = 0
    }
    SECTIONS
    {
        .text   : { *(.text*)   } > ram
        .rodata : { *(.rodata*) } > ram
        .bss    : { *(.bss*)    } > ram
        .data   : { *(.data*)   } > nada
    }
    

    and

    mov x10,#0x48000000
    mov sp,x10
    bl notmain
    b .
    

    if you use the .bin file format then you can use the example at the top and get zeroed .bss and initialized .data as trivial.

    ENTRY(_reset)
    MEMORY
    {
        ram : ORIGIN = 0x40080000, LENGTH = 0x1000
    }
    SECTIONS
    {
        .text   : { *(.text*)   } > ram
        .rodata : { *(.rodata*) } > ram
        .bss    : { *(.bss*)    } > ram
        .data   : { *(.data*)   } > ram
    }
    

    and

    .globl _reset
    _reset:
    
        mov x10,#0x48000000
        mov sp,x10
        bl notmain
        b .
    
    ...
    
    .data
    .word 0x12345678
    

    but not the elf format.