I am trying to figure out whether it is possible to run a Linux VM whose RAM is only backed by a single physical page.
To simulate this, I modified the nested page fault handler in KVM to remove the present bit from all nested page table (NPT) entries, except the one corresponding to the currently processed page fault.
While trying to start a Linux guest, I observed that assembly instructions that use memory operands, like
add [rbp+0x820DDA], ebp
lead to a page fault loop until I restore the present bit for the page containing the
instruction as well as for the page referenced in the operand (in this example [rbp+0x820DDA]
).
I am wondering why this is the case. Shouldn't the CPU access the memory pages sequentially, i.e. first read the instruction and then access the memory operand? Or does x86 require that the instruction page as well as all operand pages are accessible at the same time?
I am testing on AMD Zen 1.
Yes, they do require the machine code and all memory operands.
Shouldn't the CPU access the memory pages sequentially, i.e. first read the instruction and then access the memory operand?
Yes that's logically what happens, but a page-fault exception interrupts that 2-step process and discards any progress. The CPU doesn't have any way to remember what instruction it was in the middle of when a page-fault occurred.
When a page-fault handler returns after handling a valid page fault, RIP= the address of the faulting instruction, so the CPU retries executing it from scratch.
It would be legal for the OS to modify the machine code of the faulting instruction and expect it to execute a different instruction after iret
from the page-fault handler (or any other exception or interrupt handler). So AFAIK it's architecturally required that the CPU redoes code-fetch from CS:RIP in the case you're talking about. (Assuming it even does return to the faulting CS:RIP instead of scheduling another process while waiting for disk on hard page fault, or delivering a SIGSEGV to a signal handler on an invalid page fault.)
It's probably also architecturally required for hypervisor entry/exit. And even if it's not explicitly forbidden on paper, it's not how CPUs work.
@torek comments that Some (CISC) microprocessors partially decode instructions and dump microregister state on a page fault, but x86 is not like that. m68k does do that, for example, letting it page-fault in the middle of a memory-indirect store after some but not all bytes of a misaligned store have made it to memory, overwriting the pointer the memory-indirect addressing mode used. (x86 doesn't have memory-indirect addressing, and faults are taken at instruction boundaries.)
A few instructions are interruptible and can make partial progress, like rep movs
(memcpy in a can) and other string instructions, or gather loads/scatter stores. But the only mechanism is updating architectural registers like RCX / RSI / RDI for string ops, or the destination and mask registers for gathers / scatters (such as AVX2 vpgatherdd
). Not keeping the opcode / decode results in some hidden internal register and restarting it after iret
from a page fault handler. These are instructions that do multiple separate data accesses.
The semantics for those instructions updating the architectural register state when a fault happens are such that re-executing from scratch with the new reg values produces the same final result as if it had executed the whole way through without faulting.
Also keep in mind that x86 (like most ISAs) guarantees that instructions are atomic wrt. interrupts and also exceptions: they either fully happen, or don't happen at all, before an interrupt. Interrupting an assembly instruction while it is operating. So for example add [mem], reg
would be required to discard the load if the store part faulted, even without a lock
prefix.
(ISAs like m68k that save/restore microarchitectural state on the stack for partial progress still only take external interrupts at instruction boundaries, but not exceptions like page faults. Apparently the worst-case number of pages a single m68k instruction can touch is 16, but they don't all have to be resident at once, so it's a bit closer to an x86 gather or rep movs where partial progress gets saved.)
The worst case number of guest user-space pages present to make forward progress might be 6 (plus separate guest-kernel page-table subtrees for each one):
movsq
or movsw
2-byte instruction spanning a page boundary, so both pages are needed for it to decode.[rsi]
also a page-split[rdi]
also a page-splitIf any of these 6 pages fault, we're back to square one.
rep movsd
is also a 2-byte instruction, and making progress on one step of it would have the same requirement. Similar cases like push [mem]
or pop [mem]
could be constructed with a misaligned stack.
One of the reasons (or side benefits) for/of making gather loads / scatter stores "interruptible" (updating the mask vector with their progress) is to avoid increasing this minimum footprint to execute a single instruction. Also to improve efficiency of handling multiple faults during one gather or scatter.
As pointed out in comments, the iret
in the guest kernel's page-fault handler presumably also has to present, but if we're talking about nested page tables and the hypervisor playing tricks, it could swap out a guest-kernel page in favour of the 6 guest pages needed for rep movsd
itself to make progress. Those page-faults are to the hypervisor, not to the guest kernel that thinks it has all the pages resident plus its own kernel memory.
@Brandon points out in comments that a guest will need its page tables in memory, and the user-space page splits can also be 1GiB splits so the two sides are in different sub-trees of the top level PML4. HW page walk will need to touch all of these guest page-table pages to make progress. A situation this pathological is unlikely to happen by chance.
The TLB (and page-walker internals) are allowed to cache some of the page-table data, and aren't required to restart page-walk from scratch unless the OS did invlpg
or set a new CR3 top-level page directory. Neither of these are necessary when changing a page from not-present to present; x86 on paper guarantees that it's not needed (so "negative caching" of not-present PTEs isn't allowed, at least not visible to software). So the CPU might not VMexit even if some of the guest-physical page-table pages are not actually present.
PMU performance counters can be enabled and configured such that the instruction also requires a perf event to a write into a PEBS buffer for that instruction. With a counter's mask configured to count only user-space instructions, not kernel, it could well be that it keeps trying to overflow the counter and store a sample in the buffer every time you return to userspace, producing a page-fault.