[SOLVED] Does x86 prefetch outside of code segment?

Does x86 prefetch outside of code segment?

The x86 architecture has started using prefetching right with its original ancestor, the 8086. To my knowledge all modern microarchitectures are doing some kind of prefetching, e.g. via the instruction cache.

Let's assume that a program has set a CS segment and is currently executing instructions towards the end of said segment. If the last instruction were to fall through, i.e. not be a jump back - or a far jump "out" - I would definitely expect the next instruction to fault as it's outside of CS.

If some kind of prefetching is enabled, I would expect it to want to fetch outside of CS. Clearly, this shouldn't fault right away as the EIP has not yet reached the forbidden area. Is it guaranteed that the prefetcher either ignores these faults? Or is the prefetching even guaranteed to not stray outside the CS? Is this documented somewhere and, if so, where?

Solution

L2 hardware prefetch (e.g. Intel's "streamer") is based on physical address and knows nothing of segmentation. It might tend to stop at 4K physical page boundaries because contiguous virtual addresses might not be physically contiguous at a page boundary. (And modern x86 CPUs are optimized for paging being enabled, since that's how modern OSes use them, so they wouldn't bother to have extra hardware to keep going in modes where paging is disabled.)

Modern Intel and AMD might not use hardware prefetch into L1i cache, probably just demand code fetch from wherever branch prediction speculatively takes the front-end. https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine doesn't mention it for Zen 4. If there is any, it would be probably based just on physical address, certainly not knowing about CS.limit. (Check Intel's or AMD's optimization guides about their HW prefetch.)

Is it guaranteed that the prefetcher either ignores these faults?

Hardware cache prefetch doesn't use logical addresses so there is no fault to ignore.

Speculative fetch (and decode and exec) encountering a fault and ignoring it is totally normal for any kind of fault. e.g. consider a do{p = p->next;}while(p != NULL) loop; the final iteration will mispredict and actually run the loop body again, trying to load from address 0 which would #PF page fault.

CPUs handle this by not actually doing anything about faults until they reach retirement, i.e. become non-speculative after all previous instructions have retired. If any previous instruction was a mis-predicted branch or exception that should have been taken, the CPU will discard the speculative work it did and start fetching from the correct place.

So it's pretty much up to the CPU architect where/when EIP is checked against CS.limit; you could check it early enough to shut down useless prefetch after one that would have faulted. CPUs already do stop fetch/decode when they encounter an instruction like int or syscall that execution can never proceed past. For data loads/stores, logical (seg:off) to linear address calculation already special-cases base=0 (flat memory model that all mainstream OSes use), having 1 cycle lower latency from address to load result. Code-fetch is probably the same, basically using EIP / RIP as the linear address to look up in the TLB when CS.base=0. Limit checking can be done in parallel by hardware and treating it like an int interrupt instruction, stopping further fetches.

Or if it saves transistors, you could optimize for the CS.limit = unlimited case that all mainstream x86 OSes use in protected mode (and is the only option in 64-bit mode) and only mark these bytes as being fetched from out of bounds? (But then you need extra bits in the pipeline to mark that, including every ROB = reorder buffer entry, so this probably doesn't make sense.)

I don't know if there is branch-prediction / speculation past jmp far that loads a new CS base and limit; I'd guess maybe not since it's used so rarely. It would presumably need to predict both the new EIP / RIP (since later call has to push that as data) and the new CS.base (or linear address for CS:EIP) for code-fetch. But it wouldn't have to predict a CS.limit; that check could be deferred until retirement.

Anyway, however the hardware accomplishes it, you definitely don't get spurious segment-limit faults from code-fetch. But hardware certainly can be filling I-cache from write-back memory regions. And some bytes from outside the CS.limit might even make it into the CPU core proper, getting fetched (or pre-fetched into the buffer on an in-order CPU). Or maybe not, that depends on the design.

An in-order CPU that (like 8086) just discards its prefetch decode buffer on jumps (including far jumps which set a new CS) wouldn't get any benefit from prefetch past CS.limit, but it's also small. It might or might not be worthwhile to build extra hardware to check CS.limit, but the only benefit would be in the rare case that execution is very close to the end of CS.limit and a code-fetch memory cycle would stop a data load or store from starting sooner.

Given the limited transistor budgets in 486 and earlier, I wouldn't be surprised if they only check CS.limit when actually decoding from the prefetch buffer, not when fetching into it.