multithreading caching cpu-architecture cpu-cache

Does processor stall during cache coherence operation

Let's assume that variable a = 0

Processor1: a = 1
Processor2: print(a)

Processor1 executes it's instruction first then in next cycle processor2 reads variable to print it. So is:

processor2 gonna stall until cache coherence operation completes and it will print 1

P1:   |--a=1--|---cache--coherence---|----------------
P2:   ------|stalls due to coherence-|--print(a=1)---|
time: ----------------------------------------------->

processor2 will operate before cache coherence operation completes and it will have stale memory view until then. So it will print 0?
```
P1:   |--a=1--|---cache--coherence---|
P2:   ----------|---print(a=0)---|----
time: ------------------------------->
```
In other words can processor have stale memory view until cache coherence operations are completed?

Solution

All modern ISAs use (a variant of) MESI for cache coherency. This maintains coherency at all times of the shared view of memory (through cache) that all processors have.

See for example Can I force cache coherency on a multicore x86 CPU? It's a common misconception that stores go into cache while other cores still have old copies of the cache line, and then "cache coherence" has to happen.

But that's not the case: to modify a cache line, a CPU needs to have exclusive ownership of the line (Modified or Exclusive state of MESI). This is only possible after receiving responses to a Read For Ownership that invalidates all other copies of the cache line, if it was in Shared or Invalid state before. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for example.

However, memory models allow local reordering of stores and loads. Sequential consistency would be too slow, so CPUs always allow at least StoreLoad reordering. See also Is mov + mfence safe on NUMA? for lots of details about the TSO (total store order) memory model used on x86. Many other ISAs use an even weaker model.

For an unsynchronized reader in this case, there are three possibilities if both are running on separate cores

load(a) happens on core#2 before the cache line is invalidated, so it reads the old value and thus effectively happens before the a=1 store in the global order. The load can hit in L1d cache.
load(a) happens after core#1 has committed the store to its L1d cache, and hasn't written back yet. Core#2's read request triggers Core#2 to write-back to shared a shared level of cache (e.g. L3), and puts the line into Shared state. The load will definitely miss in L1d.
load(a) happens after write-back to memory or at least L3 has already happened, so it doesn't have to wait for core#1 to write-back. The load will miss in L1d unless hardware prefetch has brought it back in for some reason. But usually that only happens as part of sequential accesses (e.g. to an array).

So yes, the load will stall if the other core has already committed it to cache before this core tries to load it.

See also Can a speculatively executed CPU branch contain opcodes that access RAM? and Size of store buffers on Intel hardware? What exactly is a store buffer? for more about the effect of the store buffer on everything, including memory reordering. Also How does memory reordering help processors and compilers?

This doesn't matter here because you have a write-only producer and a read-only consumer. The producer core doesn't wait for its store to become globally visible before continuing, and it can see its own store right away, before it becomes globally visible. It does matter when you have each thread looking at stores done by the other thread; then you need barriers, or sequentially-consistent atomic operations (which compilers implement with barriers). See https://preshing.com/20120515/memory-reordering-caught-in-the-act

See also Is incrementing an int effectively atomic in specific cases? for how atomic RMW works with MESI, that's instructive to understanding the concept. (e.g. that an atomic RMW can work by having a core hang on to a cache line in Modified state, and delay responding to RFO or requests to share it until the write part of the RMW has committed.)