I'm trying to understand how the "fetch" phase of the CPU pipeline interacts with memory.
Let's say I have these instructions:
4: bb 01 00 00 00 mov $1,%ebx
9: bb 02 00 00 00 mov $2,%ebx
e: b3 03 mov $3,%bl
What happens if CPU1 writes 00 48 c7 c3 04 00 00 00
to memory address 8 (i.e. 64-bit aligned) while CPU2 is executing these same instructions? The instruction stream would atomically change from 2 instructions to 1 like this:
4: bb 01 00 00 00 mov $1,%ebx
9: 48 c7 c3 04 00 00 00 mov $4,%rbx
Since CPU1 is writing to the same memory that CPU2 is reading from, there's contention.
Would the write cause the CPU2 pipeline to stall while it refreshes its L1 cache?
Let's say that CPU2 has just completed the "fetch" pĥase for mov $2
, would that be discarded in order to re-fetch the updated memory?
Additionally there's the issue of atomicity when changing 2 instructions into 1.
I found this quite old document that mentions "The instruction fetch unit fetches one 32-byte cache line in each clock cycle from the instruction cache memory" which I think can be interpreted to mean that each instruction gets a fresh copy of the cache line from L1, even if they share the same cache line. But I don't know if/how this applies to modern CPUs.
If the above is correct, that would mean after fetching mov $2
into the pipeline, it's possible the next fetch would get the updated value at address e
and try to execute 00 00
(add %al,(%rax)
) which would probably fail.
But if the fetch of mov $2
brings mov $3
into an "instruction cache", would it
make sense to think that the next fetch would just get the instruction from that cache (and return mov $3
) without re-querying L1?
This would effectively make the fetch of these 2 instructions atomic, as long as they share a cache line.
So which is it? Basically there's too many unknowns and too much I can only speculate about, so I'd really appreciate a clockcycle-by-clockcycle breakdown of how 2 fetch phases of the pipeline interact with (changes in) the memory they access.
As Chris said, an RFO (Read For Ownership) can invalidate an I-cache line at any time.
Depending on how superscalar fetch-groups line up, the cache line can be invalidated between fetching the 5-byte mov
at 9:
, but before fetching the next instruction at e:
.
When fetch eventually happens (this core gets a shared copy of the cache line again), RIP = e
and it will fetch the last 2 bytes of the mov $4,%rbx
. Cross-modifying code needs to make sure that no other core is executing in the middle of where it wants to write one long instruction.
In this case, you'd get 00 00
add %al, (%rax)
.
Also note that the writing CPU needs to make sure the modification is atomic, e.g. with an 8-byte store (Intel P6 and later CPUs guarantee that stores up to 8 bytes at any alignment within 1 cache line are atomic; AMD doesn't), or lock cmpxchg
or lock cmpxchg16b
. Otherwise it's possible for a reader to see partially updated instructions. You can consider instruction-fetch to be doing atomic 16-byte loads or something like that.
"The instruction fetch unit fetches one 32-byte cache line in each clock cycle from the instruction cache memory" which I think can be interpreted to mean that each instruction gets a fresh copy of the cache line from L1,
No.
That wide fetch block is then decoded into multiple x86 instructions! The point of wide fetch is to pull in multiple instructions at once, not to redo it separately for each instruction. That document seems to be about P6 (Pentium III), although P6 only does 16 bytes of actual fetch at once, into a 32-byte wide buffer that lets the CPU take a 16-byte window.
P6 is 3-wide superscalar, and every clock cycle can decode up to 16 bytes of machine code containing up to 3 instructions. (But there's a pre-decode stage to find instruction lengths first...)
See Agner Fog's microarch guide (https://agner.org/optimize/) for details, (with a focus on details that are relevant for turning software performance.) Later microarchitectures add queues between pre-decode and decode. See those sections of Agner Fog's microarch guide, and https://realworldtech.com/merom/ (Core 2).
And of course see https://realworldtech.com/sandy-bridge for more modern x86 with a uop cache. Also https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Core for recent AMD.
For good background before reading any of those, Modern Microprocessors: A 90-Minute Guide!.
For a core modifying its own code, see: Observing stale instruction fetching on x86 with self-modifying code - that's different (and harder) because out-of-order exec of the store has to be sorted out from code-fetch of earlier vs. later instructions in program order. i.e. the moment at which the store must become visible is fixed, unlike with another core where it just happens when it happens.
Update with random facts I didn't know 2 years ago:
A serializing instruction like cpuid
(or the new serialize
) is needed after a load sees a code_updated
flag to be guaranteed to see the new code. Unlike data accesses where every load is an acquire load on x86, that's not guaranteed for cross-modifying code. Serializing instructions serialize the front-end as well, unlike mfence
or lock add
.
Also, code-fetch doesn't necessarily happen in aligned 16-byte chunks from L1i cache, so even an atomic store by one core could potentially be "torn" by a misaligned fetch-block, so even replacing one long instruction by 2 of the same total length isn't safe. Even though no other core could have a RIP that points into the middle of that changed region.
(I'm not 100% sure this is a real thing, but it might be. And it's not guaranteed on paper, which is a good reason for hot-patching in OSes like Windows to take care to quiesce all threads of a process before updating machine code.)