The RISC-V instruction auipc
does rd = (imm << 12) + PC
, being rd
the destination register and imm
a 12 bit signed immediate.
The result of the above instruction will vary depending on at which address is the binary running. Let's suppose a system uses a bootloader to boot a firmware image. In that case, the initial PC for the firmware image will be different from 0x0. This fact will be reflected in the linker script by doing something like:
.text :
{
_text = .;
*(.text)
_etext = .;
} > FW_IMG
being FW_IMG
something like:
FW_IMG (rx): ORIGIN = 2048, LENGTH = 2304
My question is, how can this work?
I mean, let's suppose a 32 bit CPU, and that the 4th instruction the compiler generates
is an auipc
. Let's supposed that the FW image is to be placed at address 0x200000000, then, the PC will be 0x20000000 + 16 (4th instruction). Will the compiler be aware of this so it generates the right values etc. for the above auipc
instruction?
A good example of this is la
. la
is a pseudo-instruction that will be expanded to an auipc
and an addi
. If the compiler generates code to load a symbol, depending on where the image is to be located at runtime, the generated instructions will be different.
I have tried to build the same image with 2 different linker scripts, completely different one from the other, and having that the first instruction is an la
. The generated auipc
instructions are indeed different in each cases, and they calculate the right address.
The only explanation I find to this is that, somehow, the assembler generates auipc
'placeholders' and then the linker fills them with the right values.
Let us ask the toolchain.
so.c
unsigned int x;
unsigned int y=5;
unsigned int more_fun ( unsigned int );
unsigned int fun ( unsigned int a )
{
x=a+y;
return(more_fun(x)+3);
}
start.s
.globl more_fun
more_fun:
j .
so.ld
MEMORY
{
mem0 : ORIGIN = 0x00003000, LENGTH = 0x1000
mem1 : ORIGIN = 0x00004000, LENGTH = 0x1000
mem2 : ORIGIN = 0x00005000, LENGTH = 0x1000
mem3 : ORIGIN = 0x00006000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > mem0
.rodata : { *(.rodata*) } > mem1
.bss : { *(.bss*) } > mem2
.data : { *(.data*) } > mem3
.got : { *(.got*) } > mem0
}
no reason at this time for this to be an actually functioning program.
position dependent
Disassembly of section .text:
00003000 <more_fun>:
3000: 0000006f j 3000 <more_fun>
00003004 <fun>:
3004: 6799 lui x15,0x6
3006: 0007a783 lw x15,0(x15) # 6000 <y>
300a: 1141 addi x2,x2,-16
300c: c606 sw x1,12(x2)
300e: 953e add x10,x10,x15
3010: 6795 lui x15,0x5
3012: 00a7a023 sw x10,0(x15) # 5000 <x>
3016: 37ed jal 3000 <more_fun>
3018: 40b2 lw x1,12(x2)
301a: 050d addi x10,x10,3
301c: 0141 addi x2,x2,16
301e: 8082 ret
Disassembly of section .sbss:
00005000 <x>:
5000: 0000 unimp
...
Disassembly of section .sdata:
00006000 <y>:
6000: 0005 c.nop 1
position independent
Disassembly of section .text:
00003000 <more_fun>:
3000: 0000006f j 3000 <more_fun>
00003004 <fun>:
3004: 00000797 auipc x15,0x0
3008: 02c7a783 lw x15,44(x15) # 3030 <_GLOBAL_OFFSET_TABLE_+0x8>
300c: 439c lw x15,0(x15)
300e: 1141 addi x2,x2,-16
3010: c606 sw x1,12(x2)
3012: 953e add x10,x10,x15
3014: 00000797 auipc x15,0x0
3018: 0187a783 lw x15,24(x15) # 302c <_GLOBAL_OFFSET_TABLE_+0x4>
301c: c388 sw x10,0(x15)
301e: 37cd jal 3000 <more_fun>
3020: 40b2 lw x1,12(x2)
3022: 050d addi x10,x10,3
3024: 0141 addi x2,x2,16
3026: 8082 ret
Disassembly of section .bss:
00005000 <x>:
5000: 0000 unimp
...
Disassembly of section .data:
00006000 <y>:
6000: 0005 c.nop 1
...
Disassembly of section .got:
00003028 <_GLOBAL_OFFSET_TABLE_>:
3028: 0000 unimp
302a: 0000 unimp
302c: 5000 lw x8,32(x8)
302e: 0000 unimp
3030: 6000 flw f8,0(x8)
3032: 0000 unimp
3034: ffff .2byte 0xffff
3036: ffff .2byte 0xffff
3038: 0000 unimp
...
AUIPC (add upper immediate to pc) is used to build pc-relative addresses and uses the U-type format. AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the address of the AUIPC instruction, then places the result in register rd.
In this case I put the got in the same section. So no major adjustment needed here. Get to the got, use the got to get to the data.
MEMORY
{
mem0 : ORIGIN = 0x00003000, LENGTH = 0x1000
mem1 : ORIGIN = 0x00004000, LENGTH = 0x1000
mem2 : ORIGIN = 0x00005000, LENGTH = 0x1000
mem3 : ORIGIN = 0x00006000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > mem0
.rodata : { *(.rodata*) } > mem1
.bss : { *(.bss*) } > mem2
.data : { *(.data*) } > mem3
}
Disassembly of section .text:
00003000 <more_fun>:
3000: 0000006f j 3000 <more_fun>
00003004 <fun>:
3004: 00003797 auipc x15,0x3
3008: 0087a783 lw x15,8(x15) # 600c <_GLOBAL_OFFSET_TABLE_+0x8>
300c: 439c lw x15,0(x15)
300e: 1141 addi x2,x2,-16
3010: c606 sw x1,12(x2)
3012: 953e add x10,x10,x15
3014: 00003797 auipc x15,0x3
3018: ff47a783 lw x15,-12(x15) # 6008 <_GLOBAL_OFFSET_TABLE_+0x4>
301c: c388 sw x10,0(x15)
301e: 37cd jal 3000 <more_fun>
3020: 40b2 lw x1,12(x2)
3022: 050d addi x10,x10,3
3024: 0141 addi x2,x2,16
3026: 8082 ret
Disassembly of section .bss:
00005000 <x>:
5000: 0000 unimp
...
Disassembly of section .data:
00006000 <y>:
6000: 0005 c.nop 1
...
Disassembly of section .got:
00006004 <_GLOBAL_OFFSET_TABLE_>:
6004: 0000 unimp
6006: 0000 unimp
6008: 5000 lw x8,32(x8)
600a: 0000 unimp
600c: 6000 flw f8,0(x8)
...
It tacked it on to .data if not specified apparently. But it is all good. You add 0x3000 to 0x3000 to get to 0x6000.
The call to more_fun is a pc-relative offset.
The jump and link (JAL) instruction uses the J-type format, where the J-immediate encodes a signed offset in multiples of 2 bytes. The offset is sign-extended and added to the address of the jump instruction to form the jump target address. Jumps can therefore target a ±1 MiB range. JAL stores the address of the instruction following the jump (pc+4) into register rd. The standard software calling convention uses x1 as the return address register and x5 as an alternate link register.
So until the program gets very big (or you play linker games to make function calls far apart) that all works.
Here is the thing about position independence...Think of it as the binary is a blob. If you load the binary above at 0x3000 then .data is at 0x6000, 0x3000 bytes away. But if you load at 0x20003000 then .data is at 0x20006000, which is still 0x3000 bytes away.
But, you have to update the got
600c: 0x20006000
But that is the whole point. You isolate the address of every global (or group of them) and put it in an table. Then if you want to relocate the program elsewhere you or the loader of the program has to find and change the entries in the got. In this case add 0x20000000 to all of them. Then the code all works.
In a bootloader situation where you are probably not an operating system parsing an elf file.
MEMORY
{
mem0 : ORIGIN = 0x00000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > mem0
.rodata : { *(.rodata*) } > mem0
.bss : { *(.bss*) } > mem0
.data : { *(.data*) } > mem0
}
Disassembly of section .text:
00000000 <more_fun>:
0: 0000006f j 0 <more_fun>
00000004 <fun>:
4: 00000797 auipc x15,0x0
8: 0347a783 lw x15,52(x15) # 38 <_GLOBAL_OFFSET_TABLE_+0x8>
c: 439c lw x15,0(x15)
e: 1141 addi x2,x2,-16
10: c606 sw x1,12(x2)
12: 953e add x10,x10,x15
14: 00000797 auipc x15,0x0
18: 0207a783 lw x15,32(x15) # 34 <_GLOBAL_OFFSET_TABLE_+0x4>
1c: c388 sw x10,0(x15)
1e: 37cd jal 0 <more_fun>
20: 40b2 lw x1,12(x2)
22: 050d addi x10,x10,3
24: 0141 addi x2,x2,16
26: 8082 ret
Disassembly of section .bss:
00000028 <x>:
28: 0000 unimp
...
Disassembly of section .data:
0000002c <y>:
2c: 0005 c.nop 1
...
Disassembly of section .got:
00000030 <_GLOBAL_OFFSET_TABLE_>:
30: 0000 unimp
32: 0000 unimp
34: 0028 addi x10,x2,8
36: 0000 unimp
38: 002c addi x11,x2,8
...
In your bootstrap you would auipc x15,0 to get the pc then you would use normal (linker plus programming) techniques to get the offset to and size of the got. And you would make the adjustment to each entry yourself before running code that relies on the .got to find the data.
Could the toolchain do this without a got?
Sure, but...
mem0 : ORIGIN = 0x10000000, LENGTH = 0x1000
Disassembly of section .text:
10000000 <more_fun>:
10000000: 0000006f j 10000000 <more_fun>
10000004 <fun>:
10000004: 00000797 auipc x15,0x0
10000008: 0287a783 lw x15,40(x15) # 1000002c <y>
1000000c: 97aa add x15,x15,x10
1000000e: 1141 addi x2,x2,-16
10000010: 853e mv x10,x15
10000012: c606 sw x1,12(x2)
10000014: 00000717 auipc x14,0x0
10000018: 00f72a23 sw x15,20(x14) # 10000028 <x>
1000001c: 37d5 jal 10000000 <more_fun>
1000001e: 40b2 lw x1,12(x2)
10000020: 050d addi x10,x10,3
10000022: 0141 addi x2,x2,16
10000024: 8082 ret
Disassembly of section .bss:
10000028 <x>:
10000028: 0000 unimp
...
Disassembly of section .data:
1000002c <y>:
1000002c: 0005 c.nop 1
...
this
mem0 : ORIGIN = 0x00000000, LENGTH = 0x1000
created an optimization I did not want.
Disassembly of section .text:
00000000 <more_fun>:
0: 0000006f j 0 <more_fun>
00000004 <fun>:
4: 02402783 lw x15,36(x0) # 24 <y>
8: 97aa add x15,x15,x10
a: 1141 addi x2,x2,-16
c: 853e mv x10,x15
e: c606 sw x1,12(x2)
10: 02f02023 sw x15,32(x0) # 20 <x>
14: 37f5 jal 0 <more_fun>
16: 40b2 lw x1,12(x2)
18: 050d addi x10,x10,3
1a: 0141 addi x2,x2,16
1c: 8082 ret
Disassembly of section .bss:
00000020 <x>:
20: 0000 unimp
...
Disassembly of section .data:
00000024 <y>:
24: 0005 c.nop 1
...
I wanted this position independence
10000004: 00000797 auipc x15,0x0
10000008: 0287a783 lw x15,40(x15) # 1000002c <y>
but despite asking for position independence I got this which is position dependent.
4: 02402783 lw x15,36(x0) # 24 <y>
fpic vs fpie. You probably want the fpie to make life much easier but as shown you need to know the tools. The tools know how to do it but we seem to be able to trip them up.
This one bothered me and delayed even writing this answer.
MEMORY
{
mem0 : ORIGIN = 0x10003000, LENGTH = 0x1000
mem1 : ORIGIN = 0x20004000, LENGTH = 0x1000
mem2 : ORIGIN = 0x30005000, LENGTH = 0x1000
mem3 : ORIGIN = 0x40006000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > mem0
.rodata : { *(.rodata*) } > mem1
.bss : { *(.bss*) } > mem2
.data : { *(.data*) } > mem3
}
Disassembly of section .text:
10003000 <more_fun>:
10003000: 0000006f j 10003000 <more_fun>
10003004 <fun>:
10003004: 30003797 auipc x15,0x30003
10003008: 0087a783 lw x15,8(x15) # 4000600c <_GLOBAL_OFFSET_TABLE_+0x8>
1000300c: 439c lw x15,0(x15)
1000300e: 1141 addi x2,x2,-16
10003010: c606 sw x1,12(x2)
10003012: 953e add x10,x10,x15
10003014: 30003797 auipc x15,0x30003
10003018: ff47a783 lw x15,-12(x15) # 40006008 <_GLOBAL_OFFSET_TABLE_+0x4>
1000301c: c388 sw x10,0(x15)
1000301e: 37cd jal 10003000 <more_fun>
10003020: 40b2 lw x1,12(x2)
10003022: 050d addi x10,x10,3
10003024: 0141 addi x2,x2,16
10003026: 8082 ret
Disassembly of section .bss:
30005000 <x>:
30005000: 0000 unimp
...
Disassembly of section .data:
40006000 <y>:
40006000: 0005 c.nop 1
...
Disassembly of section .got:
40006004 <_GLOBAL_OFFSET_TABLE_>:
40006004: 0000 unimp
40006006: 0000 unimp
40006008: 5000 lw x8,32(x8)
4000600a: 3000 fld f8,32(x8)
4000600c: 6000 flw f8,0(x8)
4000600e: 4000 lw x8,0(x8)
LOL I thought this was completely broken, but now I see....Because I used the disassembler it broke it into 16 bit values so it is actually going to 0x40006000 and 0x30005000...whew
And just to confirm:
.section .mfun
.globl more_fun
more_fun:
j .
MEMORY
{
mem0 : ORIGIN = 0x00000000, LENGTH = 0x1000
mem1 : ORIGIN = 0x20004000, LENGTH = 0x1000
mem2 : ORIGIN = 0x30005000, LENGTH = 0x1000
mem3 : ORIGIN = 0x40006000, LENGTH = 0x1000
mem4 : ORIGIN = 0x10000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > mem0
.rodata : { *(.rodata*) } > mem0
.bss : { *(.bss*) } > mem0
.data : { *(.data*) } > mem0
.mfun : { *(.mfun*) } > mem4
}
Disassembly of section .text:
00000000 <fun>:
0: 02402783 lw x15,36(x0) # 24 <y>
4: 97aa add x15,x15,x10
6: 1141 addi x2,x2,-16
8: 853e mv x10,x15
a: c606 sw x1,12(x2)
c: 02f02023 sw x15,32(x0) # 20 <x>
10: 10000097 auipc x1,0x10000
14: ff0080e7 jalr -16(x1) # 10000000 <more_fun>
18: 40b2 lw x1,12(x2)
1a: 050d addi x10,x10,3
1c: 0141 addi x2,x2,16
1e: 8082 ret
Disassembly of section .bss:
00000020 <x>:
20: 0000 unimp
...
Disassembly of section .data:
00000024 <y>:
24: 0005 c.nop 1
...
Disassembly of section .mfun:
10000000 <more_fun>:
10000000: 0000006f j 10000000 <more_fun>
for fpie that works fine...and fpic does not change it based on different assumptions.
la x5,hello
la x6,world
.data
hello: .word 0x1
world: .word 0x2
MEMORY
{
mem0 : ORIGIN = 0x00000000, LENGTH = 0x1000
mem1 : ORIGIN = 0x10004000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > mem0
.data : { *(.data*) } > mem1
}
Disassembly of section .text:
00000000 <.text>:
0: 10004297 auipc x5,0x10004
4: 00028293 mv x5,x5
8: 10004317 auipc x6,0x10004
c: ffc30313 addi x6,x6,-4 # 10004004 <world>
Disassembly of section .data:
10004000 <hello>:
10004000: 0001 .2byte 0x1
...
10004004 <world>:
10004004: 0002 .2byte 0x2
...
or
Disassembly of section .text:
00000000 <.text>:
0: 10004297 auipc x5,0x10004
4: 00c2a283 lw x5,12(x5) # 1000400c <_GLOBAL_OFFSET_TABLE_+0x4>
8: 10004317 auipc x6,0x10004
c: 00832303 lw x6,8(x6) # 10004010 <_GLOBAL_OFFSET_TABLE_+0x8>
Disassembly of section .data:
10004000 <hello>:
10004000: 0001 .2byte 0x1
...
10004004 <world>:
10004004: 0002 .2byte 0x2
...
Disassembly of section .got:
10004008 <_GLOBAL_OFFSET_TABLE_>:
10004008: 0000 .2byte 0x0
1000400a: 0000 .2byte 0x0
1000400c: 4000 .2byte 0x4000
1000400e: 1000 .2byte 0x1000
10004010: 4004 .2byte 0x4004
10004012: 1000 .2byte 0x1000
Depending on how you build it from that assembly language file.
Do I expect llvm to work exactly the same? Nope, I would personally go through the exercises before attempting to use that tool.
In general the toolchain (compiler, assembler, linker) work together, they pretty much have to. The compiler or even assembler will generate what it can with what it sees for that one object, or within one optimization domain. Then the linker does its job which depending on the ISA may modify individual instructions or fill in addresses or offsets in a pool or other to resolve all the externals. segment locations being external as well as they are not known at compile/assemble time. But then you can get into link time optimization or llvm has bytecode optimization between the frontend and backend that you can play with.
You have to know what items have to be pc-relative to each other, and then from that what items can move. .text relative to .data for example, can move the .text and not move the .data or can move both or can move .data without moving the .text, but the distance from .text to .got has to be fixed for some of those situations, but that is under your control.
If this is a bootloader situation then the loaded program is going into ram not some flash/rom and some ram so you can lump it all into one memory space and not have a .got or you can break it up and do the extra work, etc etc.
The concept and construction is similar for other instruction sets too, the specific details may vary, but the tools have to work together generating the right instructions, right EXTRA instructions, or .pool or other so that the linker can patch it all together modifying instructions or pool/table data.
The risc-v documents are about the worst I have seen in my career, the information we need seems to be there, but the organization and ability to find things is dreadful.
AUIPC (add upper immediate to pc) is used to build pc-relative addresses and uses the U-type format. AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the address of the AUIPC instruction, then places the result in register rd.
This is basically how we do (big) pc relative work in risc-v. The lower bits being zeroed out save having to do that ourselves or the linker having to do extra work with the offset in the following instruction(s). And as with most things you let the tools do the address work, you do not want to be counting instructions/bytes between things. And that address work is sometimes the compiler sometimes the assembler and sometimes the linker or a combination.
(I just did this .got thing yesterday or the day before here, and the tools were combining some data to make fewer entries in the .got which is obviously a good thing, could you imagine a program with a lot of globals or static locals? Position independents already adds enough overhead to the binary/data, but that would be...wow)