assemblyx86pagingvirtual-memoryosdev

Why do we need one jump after changing `PG` with `mov CR0, ...` when using non-completely serializing instruction?


In the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A 9.3 SERIALIZING INSTRUCTIONS

  • When an instruction is executed that enables or disables paging (that is, changes the PG flag in control register CR0), the instruction should be followed by a jump instruction. The target instruction of the jump instruction is fetched with the new setting of the PG flag (that is, paging is enabled or disabled), but the jump instruction itself is fetched with the previous setting. The Pentium 4, Intel Xeon, and P6 family processors do not require the jump operation following the move to register CR0 (because any use of the MOV instruction in a Pentium 4, Intel Xeon, or P6 family processor to write to CR0 is completely serializing). However, to maintain backwards and forward compatibility with code written to run on other IA-32 processors, it is recommended that the jump operation be performed.

"serializing instructions" will "serialize the instruction execution stream" before it runs to avoid the reordering.

Q:

  1. What is the purpose of the "jump instruction" after the special serializing instruction related with paging (i.e. mov with one CR0 register operand)? Does it imply refreshing the page table or others?

  2. What does "completely serializing" imply so that it doesn't need the jump following the "serializing instructions"?

Edited:

In the doc of March 2023 version with Order Number 325462-079US, it doesn't list jmp in "Non-privileged serializing instructions" and "Privileged serializing instructions" although it seems that jmp is one serializing instruction. This where my confusion is because it suddenly use the "jump" term in one item of section 9.3 without reusing it in other items of that section.

Then after rereading the doc following the hints of the answers. In 10.9.2 Switching Back to Real-Address Mode, it says:

  1. Execute a far JMP instruction to jump to a real-address mode program. This operation flushes the instruction queue and loads the appropriate base-address value in the CS register.

And it has one example asm code in 10.10.2 STARTUP.ASM Listing (Here I includes the line number) showing "flush" of "the instruction queue":

179 ; clear prefetch queue
180 JMP CLEAR_LABEL

One more small question after reading the answers:

  1. Is jmp one serializing instruction?

Solution

  • "Completely serializing" implies that code prefetch buffers / pipeline stages are discarded, so code-fetch of those instruction bytes is re-done with the new interpretation of CS:EIP (as a virtual address to be translated by the page tables), rather than executing already-fetched instructions using the old linear=physical interpretation. (Or if disabling paging, then the reverse, old being linear=virtual.)

    Without a jump, on old CPUs where mov cr0, reg is non-serializing, execution of later instructions could use the results of code fetches that used the old interpretation of CS:EIP.

    (A serializing instruction like cpuid or the new serialize is actually needed even on new CPUs for a case like if (new_code_written_by_another_core) jmp new_code - release/acquire semantics happen for free (no asm fence instructions) on x86 due its strongly-ordered memory model, but code-fetch is separate from data loads. A thread has to run as-if loads ran in program order, but that only applies to data loads not code-fetch, so old machine code could have already been fetched before seeing the flag that says new machine code has been stored. If machine code for later instructions might be stale, run a serializing instruction to stop it from being fetched until after all the loads/stores before the serializing instruction. Perhaps this helps understand the point of being fully serializing.)

    Out-of-order exec CPUs (P6 being Intel's first) don't rename CR0, so writing CR0 had to be special anyway. As well as PG, it contains other critical bits like PE (protected-mode enable, which changes the semantics of mov Sreg, r/m16).
    Making mov cr0, src also flush the front-end makes the old mov/jmp sequence still guarantee everything that was previously guaranteed for enabling/disabling paging, but puts the cost in the mov to CR0. So jmp can do branch prediction and speculative execution without any checks for a recently-changed-CR0, which is desirable since jmp is widely used in normal code.

    jmp/jcc/loop/call aren't "serializing instructions" in the x86 technical-terminology sense, but on older in-order CPUs had semantics for self-modifying code based on how 8086 discarded the prefetch queue on every jump. (Only self-modifying, not cross-modifying from another thread, though. Modern x86 has stronger SMC detection that avoids making jmp special at all, because that was the highest-performance way to still satisfy the requirements in the paper manuals and be compatible with older CPUs.) See also Is there a cheaper serializing instruction than cpuid? re: true serializing instructions vs. ordering other things (like lfence ordering execution but not code-fetch or the store buffer.)


    Normally the code to enable or disable paging is in an identity-mapped page (virtual = physical address), so it shouldn't matter much (if at all?) whether or not you jump right away. Unless your code is near the end of that page and the next one isn't identity-mapped. Or unless there's some interaction with other machine-state changing instructions you might run later, something other than "stale" code fetch that's relevant. I can't think of any, but maybe I'm overlooking something. Like perhaps setting the Accessed bit in the page-table entry for this code page right away, which wouldn't happen with non-paged accesses. That's not important for most code, but Intel doesn't want to make any assumptions.

    Also, if the page containing mov cr0, eax is not identity-mapped, you can't count on instruction fetch from the virtual page (mapped to a different physical address) happening for the address right after the mov. That's one reason Intel needs to at least document the behaviour, even though I'm pretty sure most code doesn't actually need a jmp right away.

    A jump right away seems like a reasonable defensive-coding practice (on CPUs before P6 / P4) that ensures you get a page-fault right away if your page tables are wrong, not confusingly at some later point once the prefetch buffer gets to instructions that were fetched from virtual CS:EIP.

    Intel manuals have tended to be conservative about recommended code sequences for stuff like this. e.g. their sequence for changing from real mode to long mode involved temporarily entering protected mode. Current CPUs in practice work fine if you go directly from real to long mode, and the docs for the control-register bits imply it should work, but some people are reluctant to assume future CPUs will support that shorter sequence because Intel's manuals don't show an example or even mention doing it that way. https://wiki.osdev.org/Setting_Up_Long_Mode Since most OSes only do this once performance is irrelevant, and it only takes a bit of extra startup code, like maybe a few tens of bytes of kernel size, so it's not crazy to follow the manual's example because that's pretty much guaranteed not to break in future CPUs.

    In this case I suspect this was just the easiest and simplest way to communicate that if you get the page tables wrong, you'll get a page fault in fetching the jump target, not the instruction right after mov cr0, eax, unless you're on a newer CPU. Because software bugs do happen, and it's hard enough to understand them, especially back in the bad old days when it was more common to test only on real hardware, not in an emulator where you could single-step and have it tell you why the system crashed even if you're in the middle of switching modes and don't have an IVT or IDT set up with any crash-reporting double-fault handler.

    On paper, Intel doesn't guarantee that all later CPUs (from Intel or other vendors) will still have serializing mov cr0, src so they recommend doing the jmp here (probably just for early and consistent detection of problems with page tables.)