How does assembler allocate memory for machine code and know what address to jump to for an external function?

I have been studying the C compilation process but cannot find an answer to this question.

The source and object codes are produced before the linking process. A code like this will compile into a .s or .o file with no problem.

int add (int a, int,b);

int main(){
int tmp = add(1,2);
return 0;
}

At this stage, the compiler or the assembler does not know the length of the add function, but a memory section must be allocated as text for the add function. For example, in the assembly output, to execute the add function, there will be a jump instruction to the add function. How will the assembler know the address of the add function in this stage? What does the assembler do in this step? Does the assembler reserve some amount of memory for the possible add function?

Solution

The object module contains information telling the linker what changes need to be made when the object module is linked with other object modules.

There are multiple object module formats, so the information in this answer is general and conceptual. Suppose there is a call instruction in the program. In the assembly language, it might look like call add. Say the machine language code for a call instruction is 0x73 followed the four-byte address of the routine to be called. Then, in the text section in the object module, the assembler would write five bytes: 0x73 0x00 0x00 0x00 0x00. There is another part of the object module called a relocation table that lists places that need to be changed when the module is linked. These are often called “fixups.” In this table, the assembler will put information about this call instruction. An entry in this table will provide information about each change to be made, including:

Which program section is to be changed.
Where in the program section the change is (by bytes from its start).
What type of change it is.
The name of the symbol referred to.

So, for example the table entry for call add might indicate it is in the text section, 34 bytes from the start, that it is a whole replacement of the address, and the name of the symbol is add. When the linker decides where add will be located in the program, it writes that address into the text section, 34 bytes from the start.

That is a simplified example using an absolute address. It is more common these days that a relative address would be used. Instead of taking an absolute address, an instruction will calculate an address as an offset from the current program counter, which contains the address of the instruction the processor is executing or is about to execute. With relative addressing, the relocation entry says the type of change is relative instead of absolute. Then the linker, instead of writing the absolute address into the text section, will calculate the difference between the call destination and the address of the call instruction (or its following instruction) and write that difference into the text section.

Other codes for the type of change might indicate that a symbol value has to be added to the data already in the text section, instead of wholly replacing it, or that a value has to be fit into a bit-field in a particular type of instruction or encoded in a particular way. But the general idea is the same: The values in the text section contain initial information, and a relocation table indicates how to change the values when linking the program (possibly including linking that occurs when loading the program into memory).