Is it called during the execution of an instruction itself (after it has been fetched & decoded)?
Or does it happen beforehand (however is it possible until we know that something is a (virtual) address?..)
Or does the CPU access the MMU at some other moment?..
My basic understanding is that the CPU decodes an instruction, gets a virtual address, calls the MMU to convert it to physical one, then the CPU operates (loads/stores etc.) on the physical address.
Is it correct?
Is it called during the execution of an instruction itself (after it has been fetched & decoded)?
For the original 80386; (if paging is enabled) whenever the CPU needs to access virtual memory it would use page directories then page tables to determine the correct physical address/es. E.g. for an instruction like "mov eax,[ebx]
" the CPU would convert the linear address of the instruction into a physical address, then fetch the instruction, then convert the address in ebx
into a physical address and fetch the data. Worse; if something is split across a page boundary (e.g. first half of the instruction at the end of one page, last half of the same instruction on the next page) the CPU might need to convert 2 linear addresses into physical addresses to get both pieces.
This seems relatively slow now; but back then the speed of cache (and the speed of RAM) was closer to the speed of the CPU, so it wasn't as bad as it seems.
As CPUs got faster the cost of converting linear addresses into physical addresses became more of a problem. For 80486 there were 2 changes to help this - caches were added to the CPU itself (instead of being an external part on the motherboard) to speed cache up, and TLBs (Translation Look-aside Buffers) were added to cache previously done "linear address to physical address" translations. The TLBs helped a lot - if there's a "TLB hit" the CPU can skip accessing the page directory (from cache or RAM) then skip accessing the page table (from cache or RAM). Of course if there's a "TLB miss" it will still have to do all the work to convert a linear address into a physical address (and cache the translation in the TLB).
As time went by the fundamental concept remained the same but everything grew - more larger caches, more larger TLBs. Then (for 64-bit) the number of "levels of tables" grew to 4 (page map level 4, page directory pointer table, page directory, page table) and the cost of "TLB miss" increased. To help cope with that Intel added higher level translation caches (e.g. to cache page directory pointer table entries) so if there's a TLB miss but a "higher level translation cache hit" it could avoid some of the memory accesses (e.g. avoid fetching from PML4 and page directory).
Also Intel started getting more advanced with L1 instruction cache - adding extra tags to each cache line so they could skip conversion for instruction fetches.
I should also point out that a lot of this happen in parallel while a CPU is also doing other things (CPU could be fetching some instructions while decoding other instructions while doing linear to physical address translation for other instructions while ...); which also helps to hide the cost of linear to physical address translation.