linker embedded cpu-architecture riscv relocation

How can instructions such as the RISC-V auipc work when the FW image is to be placed at some random address?

The RISC-V instruction auipc does rd = (imm << 12) + PC, being rd the destination register and imm a 12 bit signed immediate.

The result of the above instruction will vary depending on at which address is the binary running. Let's suppose a system uses a bootloader to boot a firmware image. In that case, the initial PC for the firmware image will be different from 0x0. This fact will be reflected in the linker script by doing something like:

    .text :
    {
        _text = .;
        *(.text)
        _etext = .;
    } > FW_IMG

being FW_IMG something like:

FW_IMG      (rx): ORIGIN = 2048,    LENGTH = 2304

My question is, how can this work?

I mean, let's suppose a 32 bit CPU, and that the 4th instruction the compiler generates is an auipc. Let's supposed that the FW image is to be placed at address 0x200000000, then, the PC will be 0x20000000 + 16 (4th instruction). Will the compiler be aware of this so it generates the right values etc. for the above auipc instruction?

EDIT

A good example of this is la. la is a pseudo-instruction that will be expanded to an auipc and an addi. If the compiler generates code to load a symbol, depending on where the image is to be located at runtime, the generated instructions will be different.

EDIT 2

I have tried to build the same image with 2 different linker scripts, completely different one from the other, and having that the first instruction is an la. The generated auipc instructions are indeed different in each cases, and they calculate the right address.

The only explanation I find to this is that, somehow, the assembler generates auipc 'placeholders' and then the linker fills them with the right values.

Solution

Let us ask the toolchain.

so.c

unsigned int x;
unsigned int y=5;
unsigned int more_fun ( unsigned int );
unsigned int fun ( unsigned int a )
{
    x=a+y;
    return(more_fun(x)+3);
}

start.s

.globl more_fun
more_fun:
    j .

so.ld

MEMORY
{
    mem0 : ORIGIN = 0x00003000, LENGTH = 0x1000
    mem1 : ORIGIN = 0x00004000, LENGTH = 0x1000
    mem2 : ORIGIN = 0x00005000, LENGTH = 0x1000
    mem3 : ORIGIN = 0x00006000, LENGTH = 0x1000
}
SECTIONS
{
    .text   : { *(.text*)   } > mem0
    .rodata : { *(.rodata*) } > mem1
    .bss    : { *(.bss*)    } > mem2
    .data   : { *(.data*)   } > mem3
    .got    : { *(.got*)    } > mem0
}

no reason at this time for this to be an actually functioning program.

position dependent

Disassembly of section .text:

00003000 <more_fun>:
    3000:   0000006f            j   3000 <more_fun>

00003004 <fun>:
    3004:   6799                    lui x15,0x6
    3006:   0007a783            lw  x15,0(x15) # 6000 <y>
    300a:   1141                    addi    x2,x2,-16
    300c:   c606                    sw  x1,12(x2)
    300e:   953e                    add x10,x10,x15
    3010:   6795                    lui x15,0x5
    3012:   00a7a023            sw  x10,0(x15) # 5000 <x>
    3016:   37ed                    jal 3000 <more_fun>
    3018:   40b2                    lw  x1,12(x2)
    301a:   050d                    addi    x10,x10,3
    301c:   0141                    addi    x2,x2,16
    301e:   8082                    ret

Disassembly of section .sbss:

00005000 <x>:
    5000:   0000                    unimp
    ...

Disassembly of section .sdata:

00006000 <y>:
    6000:   0005                    c.nop   1

position independent

Disassembly of section .text:

00003000 <more_fun>:
    3000:   0000006f            j   3000 <more_fun>

00003004 <fun>:
    3004:   00000797            auipc   x15,0x0
    3008:   02c7a783            lw  x15,44(x15) # 3030 <_GLOBAL_OFFSET_TABLE_+0x8>
    300c:   439c                    lw  x15,0(x15)
    300e:   1141                    addi    x2,x2,-16
    3010:   c606                    sw  x1,12(x2)
    3012:   953e                    add x10,x10,x15
    3014:   00000797            auipc   x15,0x0
    3018:   0187a783            lw  x15,24(x15) # 302c <_GLOBAL_OFFSET_TABLE_+0x4>
    301c:   c388                    sw  x10,0(x15)
    301e:   37cd                    jal 3000 <more_fun>
    3020:   40b2                    lw  x1,12(x2)
    3022:   050d                    addi    x10,x10,3
    3024:   0141                    addi    x2,x2,16
    3026:   8082                    ret

Disassembly of section .bss:

00005000 <x>:
    5000:   0000                    unimp
    ...

Disassembly of section .data:

00006000 <y>:
    6000:   0005                    c.nop   1
    ...

Disassembly of section .got:

00003028 <_GLOBAL_OFFSET_TABLE_>:
    3028:   0000                    unimp
    302a:   0000                    unimp
    302c:   5000                    lw  x8,32(x8)
    302e:   0000                    unimp
    3030:   6000                    flw f8,0(x8)
    3032:   0000                    unimp
    3034:   ffff                    .2byte  0xffff
    3036:   ffff                    .2byte  0xffff
    3038:   0000                    unimp
    ...

AUIPC (add upper immediate to pc) is used to build pc-relative addresses and uses the U-type format. AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the address of the AUIPC instruction, then places the result in register rd.

In this case I put the got in the same section. So no major adjustment needed here. Get to the got, use the got to get to the data.

MEMORY
{
    mem0 : ORIGIN = 0x00003000, LENGTH = 0x1000
    mem1 : ORIGIN = 0x00004000, LENGTH = 0x1000
    mem2 : ORIGIN = 0x00005000, LENGTH = 0x1000
    mem3 : ORIGIN = 0x00006000, LENGTH = 0x1000
}
SECTIONS
{
    .text   : { *(.text*)   } > mem0
    .rodata : { *(.rodata*) } > mem1
    .bss    : { *(.bss*)    } > mem2
    .data   : { *(.data*)   } > mem3
}

Disassembly of section .text:

00003000 <more_fun>:
    3000:   0000006f            j   3000 <more_fun>

00003004 <fun>:
    3004:   00003797            auipc   x15,0x3
    3008:   0087a783            lw  x15,8(x15) # 600c <_GLOBAL_OFFSET_TABLE_+0x8>
    300c:   439c                    lw  x15,0(x15)
    300e:   1141                    addi    x2,x2,-16
    3010:   c606                    sw  x1,12(x2)
    3012:   953e                    add x10,x10,x15
    3014:   00003797            auipc   x15,0x3
    3018:   ff47a783            lw  x15,-12(x15) # 6008 <_GLOBAL_OFFSET_TABLE_+0x4>
    301c:   c388                    sw  x10,0(x15)
    301e:   37cd                    jal 3000 <more_fun>
    3020:   40b2                    lw  x1,12(x2)
    3022:   050d                    addi    x10,x10,3
    3024:   0141                    addi    x2,x2,16
    3026:   8082                    ret

Disassembly of section .bss:

00005000 <x>:
    5000:   0000                    unimp
    ...

Disassembly of section .data:

00006000 <y>:
    6000:   0005                    c.nop   1
    ...

Disassembly of section .got:

00006004 <_GLOBAL_OFFSET_TABLE_>:
    6004:   0000                    unimp
    6006:   0000                    unimp
    6008:   5000                    lw  x8,32(x8)
    600a:   0000                    unimp
    600c:   6000                    flw f8,0(x8)
    ...

It tacked it on to .data if not specified apparently. But it is all good. You add 0x3000 to 0x3000 to get to 0x6000.

The call to more_fun is a pc-relative offset.

The jump and link (JAL) instruction uses the J-type format, where the J-immediate encodes a signed offset in multiples of 2 bytes. The offset is sign-extended and added to the address of the jump instruction to form the jump target address. Jumps can therefore target a ±1 MiB range. JAL stores the address of the instruction following the jump (pc+4) into register rd. The standard software calling convention uses x1 as the return address register and x5 as an alternate link register.

So until the program gets very big (or you play linker games to make function calls far apart) that all works.

Here is the thing about position independence...Think of it as the binary is a blob. If you load the binary above at 0x3000 then .data is at 0x6000, 0x3000 bytes away. But if you load at 0x20003000 then .data is at 0x20006000, which is still 0x3000 bytes away.

But, you have to update the got

    600c:   0x20006000

But that is the whole point. You isolate the address of every global (or group of them) and put it in an table. Then if you want to relocate the program elsewhere you or the loader of the program has to find and change the entries in the got. In this case add 0x20000000 to all of them. Then the code all works.

In a bootloader situation where you are probably not an operating system parsing an elf file.

MEMORY
{
    mem0 : ORIGIN = 0x00000000, LENGTH = 0x1000
}
SECTIONS
{
    .text   : { *(.text*)   } > mem0
    .rodata : { *(.rodata*) } > mem0
    .bss    : { *(.bss*)    } > mem0
    .data   : { *(.data*)   } > mem0
}


Disassembly of section .text:

00000000 <more_fun>:
   0:   0000006f            j   0 <more_fun>

00000004 <fun>:
   4:   00000797            auipc   x15,0x0
   8:   0347a783            lw  x15,52(x15) # 38 <_GLOBAL_OFFSET_TABLE_+0x8>
   c:   439c                    lw  x15,0(x15)
   e:   1141                    addi    x2,x2,-16
  10:   c606                    sw  x1,12(x2)
  12:   953e                    add x10,x10,x15
  14:   00000797            auipc   x15,0x0
  18:   0207a783            lw  x15,32(x15) # 34 <_GLOBAL_OFFSET_TABLE_+0x4>
  1c:   c388                    sw  x10,0(x15)
  1e:   37cd                    jal 0 <more_fun>
  20:   40b2                    lw  x1,12(x2)
  22:   050d                    addi    x10,x10,3
  24:   0141                    addi    x2,x2,16
  26:   8082                    ret

Disassembly of section .bss:

00000028 <x>:
  28:   0000                    unimp
    ...

Disassembly of section .data:

0000002c <y>:
  2c:   0005                    c.nop   1
    ...

Disassembly of section .got:

00000030 <_GLOBAL_OFFSET_TABLE_>:
  30:   0000                    unimp
  32:   0000                    unimp
  34:   0028                    addi    x10,x2,8
  36:   0000                    unimp
  38:   002c                    addi    x11,x2,8
    ...

In your bootstrap you would auipc x15,0 to get the pc then you would use normal (linker plus programming) techniques to get the offset to and size of the got. And you would make the adjustment to each entry yourself before running code that relies on the .got to find the data.

Could the toolchain do this without a got?

Sure, but...

mem0 : ORIGIN = 0x10000000, LENGTH = 0x1000

Disassembly of section .text:

10000000 <more_fun>:
10000000:   0000006f            j   10000000 <more_fun>

10000004 <fun>:
10000004:   00000797            auipc   x15,0x0
10000008:   0287a783            lw  x15,40(x15) # 1000002c <y>
1000000c:   97aa                    add x15,x15,x10
1000000e:   1141                    addi    x2,x2,-16
10000010:   853e                    mv  x10,x15
10000012:   c606                    sw  x1,12(x2)
10000014:   00000717            auipc   x14,0x0
10000018:   00f72a23            sw  x15,20(x14) # 10000028 <x>
1000001c:   37d5                    jal 10000000 <more_fun>
1000001e:   40b2                    lw  x1,12(x2)
10000020:   050d                    addi    x10,x10,3
10000022:   0141                    addi    x2,x2,16
10000024:   8082                    ret

Disassembly of section .bss:

10000028 <x>:
10000028:   0000                    unimp
    ...

Disassembly of section .data:

1000002c <y>:
1000002c:   0005                    c.nop   1
    ...

this

mem0 : ORIGIN = 0x00000000, LENGTH = 0x1000

created an optimization I did not want.

Disassembly of section .text:

00000000 <more_fun>:
   0:   0000006f            j   0 <more_fun>

00000004 <fun>:
   4:   02402783            lw  x15,36(x0) # 24 <y>
   8:   97aa                    add x15,x15,x10
   a:   1141                    addi    x2,x2,-16
   c:   853e                    mv  x10,x15
   e:   c606                    sw  x1,12(x2)
  10:   02f02023            sw  x15,32(x0) # 20 <x>
  14:   37f5                    jal 0 <more_fun>
  16:   40b2                    lw  x1,12(x2)
  18:   050d                    addi    x10,x10,3
  1a:   0141                    addi    x2,x2,16
  1c:   8082                    ret

Disassembly of section .bss:

00000020 <x>:
  20:   0000                    unimp
    ...

Disassembly of section .data:

00000024 <y>:
  24:   0005                    c.nop   1
    ...

I wanted this position independence

10000004:   00000797            auipc   x15,0x0
10000008:   0287a783            lw  x15,40(x15) # 1000002c <y>

but despite asking for position independence I got this which is position dependent.

   4:   02402783            lw  x15,36(x0) # 24 <y>

fpic vs fpie. You probably want the fpie to make life much easier but as shown you need to know the tools. The tools know how to do it but we seem to be able to trip them up.

This one bothered me and delayed even writing this answer.

MEMORY
{
    mem0 : ORIGIN = 0x10003000, LENGTH = 0x1000
    mem1 : ORIGIN = 0x20004000, LENGTH = 0x1000
    mem2 : ORIGIN = 0x30005000, LENGTH = 0x1000
    mem3 : ORIGIN = 0x40006000, LENGTH = 0x1000
}
SECTIONS
{
    .text   : { *(.text*)   } > mem0
    .rodata : { *(.rodata*) } > mem1
    .bss    : { *(.bss*)    } > mem2
    .data   : { *(.data*)   } > mem3
}


Disassembly of section .text:

10003000 <more_fun>:
10003000:   0000006f            j   10003000 <more_fun>

10003004 <fun>:
10003004:   30003797            auipc   x15,0x30003
10003008:   0087a783            lw  x15,8(x15) # 4000600c <_GLOBAL_OFFSET_TABLE_+0x8>
1000300c:   439c                    lw  x15,0(x15)
1000300e:   1141                    addi    x2,x2,-16
10003010:   c606                    sw  x1,12(x2)
10003012:   953e                    add x10,x10,x15
10003014:   30003797            auipc   x15,0x30003
10003018:   ff47a783            lw  x15,-12(x15) # 40006008 <_GLOBAL_OFFSET_TABLE_+0x4>
1000301c:   c388                    sw  x10,0(x15)
1000301e:   37cd                    jal 10003000 <more_fun>
10003020:   40b2                    lw  x1,12(x2)
10003022:   050d                    addi    x10,x10,3
10003024:   0141                    addi    x2,x2,16
10003026:   8082                    ret

Disassembly of section .bss:

30005000 <x>:
30005000:   0000                    unimp
    ...

Disassembly of section .data:

40006000 <y>:
40006000:   0005                    c.nop   1
    ...

Disassembly of section .got:

40006004 <_GLOBAL_OFFSET_TABLE_>:
40006004:   0000                    unimp
40006006:   0000                    unimp
40006008:   5000                    lw  x8,32(x8)
4000600a:   3000                    fld f8,32(x8)
4000600c:   6000                    flw f8,0(x8)
4000600e:   4000                    lw  x8,0(x8)

LOL I thought this was completely broken, but now I see....Because I used the disassembler it broke it into 16 bit values so it is actually going to 0x40006000 and 0x30005000...whew

And just to confirm:

.section .mfun

.globl more_fun
more_fun:
    j .


MEMORY
{
    mem0 : ORIGIN = 0x00000000, LENGTH = 0x1000
    mem1 : ORIGIN = 0x20004000, LENGTH = 0x1000
    mem2 : ORIGIN = 0x30005000, LENGTH = 0x1000
    mem3 : ORIGIN = 0x40006000, LENGTH = 0x1000
    mem4 : ORIGIN = 0x10000000, LENGTH = 0x1000
}
SECTIONS
{
    .text   : { *(.text*)   } > mem0
    .rodata : { *(.rodata*) } > mem0
    .bss    : { *(.bss*)    } > mem0
    .data   : { *(.data*)   } > mem0
    .mfun   : { *(.mfun*)   } > mem4
}

Disassembly of section .text:

00000000 <fun>:
   0:   02402783            lw  x15,36(x0) # 24 <y>
   4:   97aa                    add x15,x15,x10
   6:   1141                    addi    x2,x2,-16
   8:   853e                    mv  x10,x15
   a:   c606                    sw  x1,12(x2)
   c:   02f02023            sw  x15,32(x0) # 20 <x>
  10:   10000097            auipc   x1,0x10000
  14:   ff0080e7            jalr    -16(x1) # 10000000 <more_fun>
  18:   40b2                    lw  x1,12(x2)
  1a:   050d                    addi    x10,x10,3
  1c:   0141                    addi    x2,x2,16
  1e:   8082                    ret

Disassembly of section .bss:

00000020 <x>:
  20:   0000                    unimp
    ...

Disassembly of section .data:

00000024 <y>:
  24:   0005                    c.nop   1
    ...

Disassembly of section .mfun:

10000000 <more_fun>:
10000000:   0000006f            j   10000000 <more_fun>

for fpie that works fine...and fpic does not change it based on different assumptions.

la x5,hello
la x6,world

.data

hello: .word 0x1
world: .word 0x2

MEMORY
{
    mem0 : ORIGIN = 0x00000000, LENGTH = 0x1000
    mem1 : ORIGIN = 0x10004000, LENGTH = 0x1000
}
SECTIONS
{
    .text   : { *(.text*)   } > mem0
    .data   : { *(.data*)   } > mem1
}

Disassembly of section .text:

00000000 <.text>:
   0:   10004297            auipc   x5,0x10004
   4:   00028293            mv  x5,x5
   8:   10004317            auipc   x6,0x10004
   c:   ffc30313            addi    x6,x6,-4 # 10004004 <world>

Disassembly of section .data:

10004000 <hello>:
10004000:   0001                    .2byte  0x1
    ...

10004004 <world>:
10004004:   0002                    .2byte  0x2
    ...

Disassembly of section .text:

00000000 <.text>:
   0:   10004297            auipc   x5,0x10004
   4:   00c2a283            lw  x5,12(x5) # 1000400c <_GLOBAL_OFFSET_TABLE_+0x4>
   8:   10004317            auipc   x6,0x10004
   c:   00832303            lw  x6,8(x6) # 10004010 <_GLOBAL_OFFSET_TABLE_+0x8>

Disassembly of section .data:

10004000 <hello>:
10004000:   0001                    .2byte  0x1
    ...

10004004 <world>:
10004004:   0002                    .2byte  0x2
    ...

Disassembly of section .got:

10004008 <_GLOBAL_OFFSET_TABLE_>:
10004008:   0000                    .2byte  0x0
1000400a:   0000                    .2byte  0x0
1000400c:   4000                    .2byte  0x4000
1000400e:   1000                    .2byte  0x1000
10004010:   4004                    .2byte  0x4004
10004012:   1000                    .2byte  0x1000

Depending on how you build it from that assembly language file.

Do I expect llvm to work exactly the same? Nope, I would personally go through the exercises before attempting to use that tool.

In general the toolchain (compiler, assembler, linker) work together, they pretty much have to. The compiler or even assembler will generate what it can with what it sees for that one object, or within one optimization domain. Then the linker does its job which depending on the ISA may modify individual instructions or fill in addresses or offsets in a pool or other to resolve all the externals. segment locations being external as well as they are not known at compile/assemble time. But then you can get into link time optimization or llvm has bytecode optimization between the frontend and backend that you can play with.

You have to know what items have to be pc-relative to each other, and then from that what items can move. .text relative to .data for example, can move the .text and not move the .data or can move both or can move .data without moving the .text, but the distance from .text to .got has to be fixed for some of those situations, but that is under your control.

If this is a bootloader situation then the loaded program is going into ram not some flash/rom and some ram so you can lump it all into one memory space and not have a .got or you can break it up and do the extra work, etc etc.

The concept and construction is similar for other instruction sets too, the specific details may vary, but the tools have to work together generating the right instructions, right EXTRA instructions, or .pool or other so that the linker can patch it all together modifying instructions or pool/table data.

The risc-v documents are about the worst I have seen in my career, the information we need seems to be there, but the organization and ability to find things is dreadful.

AUIPC (add upper immediate to pc) is used to build pc-relative addresses and uses the U-type format. AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the address of the AUIPC instruction, then places the result in register rd.

This is basically how we do (big) pc relative work in risc-v. The lower bits being zeroed out save having to do that ourselves or the linker having to do extra work with the offset in the following instruction(s). And as with most things you let the tools do the address work, you do not want to be counting instructions/bytes between things. And that address work is sometimes the compiler sometimes the assembler and sometimes the linker or a combination.

(I just did this .got thing yesterday or the day before here, and the tools were combining some data to make fewer entries in the .got which is obviously a good thing, could you imagine a program with a lot of globals or static locals? Position independents already adds enough overhead to the binary/data, but that would be...wow)