[SOLVED] Which execution unit in the CPU executes the prefetch instruction?

Hardware prefetch doesn't use an execution unit; a separate piece of hardware can generate extra requests to pull data into L2 for example, or into L1d. (In Intel CPUs, by injecting requests into the superqueue or an LFB. In Intel CPUs, the most important prefetcher is the L2 streamer that watches for patterns of accesses to L2 cache.) Not by feeding uops to be executed into the instruction stream. Similar to how the page-walkers might generate extra load requests to L1d in parallel with load execution units. (Probably contending with them for limited cache read ports. Port as in multi-ported SRAM, not as in execution ports/units.)

Software prefetch runs on a load execution unit. Any differences from normal (demand) loads are handled by minor extra functionality in the load execution unit.

It needs address-generation and needs to work almost exactly like a load, even for x86 prefetchw or equivalent (write-intent, i.e. prefetch into MESI Exclusive state with a Read For Ownership. That's still a load if not present in cache.) But unlike a demand load, it can give up and do nothing if all buffers are already full for tracking new cache lines, since it's just a hint.

I'd be surprised if any microarchitecture for any ISA has different execution units for SW prefetch vs. loads. (Unless the ISA is very different from mainstream ones like x86 and AArch64.) The tradeoff is adding a bit of flexibility to a load unit, vs. adding a whole new unit: a prefetch-only execution unit would have most of what a load unit needs (including probing L1d tags to see if the data's already here before generating a request, address-generation, and a TLB read port. Also snooping the store buffer, and checking already in-flight cache lines so you don't use up another buffer for a line that's already on its way.) If you're going to do that, it's vastly more useful to have it also able to run normal loads, instead of trying to share a smaller number of TLB and cache-tag read ports, or worse having more ports but normally not using them all. (Or I guess you could put it on the same execution port / pipe as a load unit so they couldn't both start a uop in the same cycle, but you'd still be replicating a lot of functionality and need to manage sharing of read ports for cache and TLB, etc.)

Not faulting

Not faulting is actually very easy: normal demand loads only fault if they reach retirement (become non-speculative). The normal way to handle this is that the ROB (ReOrder Buffer) entry tracking it is marked as raising an exception if it retires, as part of the load execution unit completing the work of executing the uop and marking the ROB entry as complete (ready to retire). So there's already a fault-or-not bit per ROB entry, or perhaps multiple bits to figure out which kind of fault. The fact that faulting loads don't do anything special until retirement is part of the key to Meltdown (along with the actual data forwarded to dependent uops, which will never become architecturally visible ... except via timing side channels.) See Out-of-order execution vs. speculative execution for more details about how speculative execution works. As far as I know, all out-of-order exec CPUs use the same strategy of not doing anything about faulting instructions until retirement; being vulnerable to Meltdown or not is a matter of whether dependent uops execute after a faulting load, and if so what data they see. (e.g. always-zero would be fine.)

Software prefetch instructions simply don't set that fault-on-retirement bit in the ROB entry, regardless of the address or result of a page-walk. (Even a non-canonical address doesn't fault on x86-64.) They're only hints so a CPU might choose not to even wait for a page-walk if both page-walk units are already busy on a TLB-miss prefetch.

The same fault-on-retirement mechanism is used to make speculative execution possible for all instructions, including others like div that might fault for other reasons, and normal demand loads. (Older CPUs used to handle branches the same way, only recovering from mispredicts on retirement, but branches are special and real programs do have branch mispredicts on fast paths, so extra hardware (a "branch order buffer") enables "fast recovery", starting when a mispredict is first detected.)

The trick is that every instruction is treated as speculative until retirement, whether any branch or could-fault instructions executed recently or not. Instructions that could fault or mispredict are too common in real code to optimize for any other case.

The only problem that doesn't solve is stores: mis-speculated store data must not become visible to other cores. Normally that means we can't let them write directly to L1d cache. Store buffers are the solution to that problem: Can a speculatively executed CPU branch contain opcodes that access RAM?

Not blocking retirement on cache-miss

On x86, normal loads have to fully finish (data arrived even on cache miss) before they can retire. This is part of how most x86 CPUs maintain strong memory ordering, e.g. LoadLoad and LoadStore, unlike weakly-ordered ISAs where a load can fully retire as long as it's known to be non-faulting, using other (cheaper) mechanisms to track the fact that the register result is waiting for a cache line to arrive, maybe like in-order CPUs can and do.

But prefetch instructions don't need to wait for the data; once they feed the request for a cache line into a line-fill buffer (to track cache lines that L1d is waiting for), they can mark the uop as ready to retire in the ROB. (Even for normal loads, the uop can leave the scheduler when first dispatched, even if results in a cache miss.)

So x86 load execution units need to support that different behaviour, too. On weakly-ordered ISAs, even demand loads can be more fire and forget in terms of the ROB, with I assume just the load buffer entry that watches for the data to arrive and signals dependent uops that their inputs are ready.

But they already need to handle lots of cases like TLB miss, L2 TLB miss triggering a page-walk, eventual page-fault, cache-line split or even page-split requiring another cycle later on the same port, on ISAs that allow unaligned loads. As well as handling uncacheable (perhaps MMIO) vs. cacheable loads, e.g. x86 MTRR or PAT. (SW prefetch works like a byte load, so it doesn't have to worry about splits.)

So the few modifications needed to support SW prefetch are pretty minor compared to the complexity of a load port, and they're used rarely enough that it wouldn't make sense to have another whole execution unit for prefetches.

For x86, you can check https://uops.info/ and see they use a load port on Intel CPUs. (e.g. port 2 or 3 in Skylake / Ice Lake, or port 2/3/A in Alder Lake). uops.info doesn't actually have execution-port details for AMD for most non-SIMD instructions, but it would be the same there, too.