performancex86cpu-architecturepage-fault

Is it possible to “abort” when loading a register from memory rather the triggering a page fault?


I am thinking about 'Minimizing page faults (and TLB faults) while “walking” a large graph'

'How to know whether a pointer is in physical memory or it will trigger a Page Fault?' is a related question looking at the problem from the other side, but does not have a solution.

I wish to be able to load some data from memory into a register, but have the load abort rather than getting a page fault, if the memory is currently paged out. I need the code to work in user space on both Windows and Linux without needing any none standard permission.

(Ideally, I would also like to abort on a TLB fault.)


Solution

  • There's unfortunately no instruction that just queries the TLB or the current page table with the result in a register, on x86 (or any other ISA I know of). Maybe there should be, because it could be implemented very cheaply.

    (For querying virtual memory for pages being paged out or not, there is the Linux system call mincore(2) that produces a bitmap of present/absent for a range of pages starting (given as void* start / size_t length. That's maybe similar to the HW page tables so probably could let you avoid page faults until after you've touched memory, but unrelated to TLB or cache. And maybe doesn't rule out soft page faults, only hard. And of course that's only the current situation: pages could be evicted between query and access.)


    Would a CPU feature like this be useful? probably yes for a few cases

    Such a thing would be hard to use in a way that paid off, because every "false" attempt is CPU time / instructions that didn't accomplish any useful work. But a case like this could possibly be a win, when you don't care what order you traverse a tree / graph in, and some nodes might be hot in cache, TLB, or even just RAM while others are cold or even paged out to disk.

    When memory is tight, touching a cold page could even evict a currently-hot page before you get to it.

    Normal CPUs (like modern x86) can do speculative / out-of-order page walks (to fill TLB entries), and definitely speculative loads into cache, but not page faults. Page faults are handled in software by the kernel. Taking a page-fault can't happen speculatively, and is serializing. (CPUs don't rename the privilege level.)

    So software prefetch can cheaply get the hardware to fill TLB and cache while you touch other memory, if you the one you're going to touch 2nd was cold. If it was hot and you touch the cold side first, that's unfortunate. If there was a cheap way to check hot/cold, it might be worth using it to always go the right way (at least on the first step) in traversal order when one pointer is hot and the other is cold. Unless a read-only transaction is quite cheap, it's probably not worth actually using Margaret's clever answer.

    If you have 2 pointers you will eventually dereference, and one of them points to a page that's been paged out while the other is hot, the best case would be to somehow detect this and get the OS to start paging in one page from disk in the background while you traverse the side that's already in RAM. (e.g. with Windows PrefetchVirtualMemory or Linux madvise(MADV_WILLNEED). See answers on the OP's other question: Minimizing page faults (and TLB faults) while "walking" a large graph)

    This will require a system call, but system calls are expensive and pollute caches + TLBs, especially on current x86 where Spectre + Meltdown mitigation adds thousands of clock cycles. So it's not worth it to make a VM prefetch system call for one of every pair of pointers in a tree. You'd get a massive slowdown for cases when all the pointers were in RAM.


    CPU design possibilities

    Like I said, I don't think any current ISAs have this, but it would I think be easy to support in hardware with instructions that run kind of like load instructions, but produce a result based on the TLB lookup instead of fetching data from L1d cache.

    There are a couple possibilities that come to mind:

    The query idea would only be useful for performance, not correctness, because there'd always be a gap between querying and using during which the page could be unmapped. (Like the IsBadXxxPtr Windows system call, except that that probably checks the logical memory map, not the hardware page tables.)

    A try_load insn that also sets/clear flags instead of raising #PF could avoid the race condition. You could have different versions of it, or it could take an immediate to choose the abort condition (e.g. TLB miss without attempt page-walk).

    These instructions could easily decode to a load uop, probably just one. The load ports on modern x86 already support normal loads, software prefetch, broadcast loads, zero or sign-extending loads (movsx r32, m8 is a single uop for a load port on Intel), and even vmovddup ymm, m256 (two in-lane broadcasts) for some reason, so adding another kind of load uop doesn't seem like a problem.

    Loads that hit a TLB entry they don't have permission for (kernel-only mapping) do currently behave specially on some x86 uarches (the ones that aren't vulnerable to Meltdown). See The Microarchitecture Behind Meltdown on Henry Wong's blod (stuffedcow.net). According to his testing, some CPUs produce a zero for speculative execution of later instructions after a TLB/page miss (entry not present). So we already know that doing something with a TLB hit/miss result should be able to affect the integer result of a load. (Of course, a TLB miss is different from a hit on a privileged entry.)

    Setting flags from a load is not something that ever normally happens on x86 (only from micro-fused load+alu), so maybe it would be implemented with an ALU uop as well, if Intel ever did implement this idea.

    Aborting on a condition other than TLB/page miss or L1d miss would require outer levels of cache to also support this special request, though. A try_load that runs if it hits L3 cache but aborts on L3 miss would need support from the L3 cache. I think we could do without that, though.

    The low-hanging fruit for this CPU-architecture idea is reducing page faults and maybe page walks, which are significantly more expensive than L3 cache misses.

    I suspect that trying to branch on L3 cache misses would cost you too much in branch misses for it to really be worth it vs. just letting out-of-order exec do its thing. Especially if you have hyperthreading so this latency-bound process can happen on one logical core of a CPU that's also doing something else.