I know that x86 processors use TSO memory model and I am curious about one thing. I will explain it through example.
We have two processors (P1 and P2) where P1 stores X=1
to its store buffer and P2 stores X=2
to its store buffer. If P1/P2 reads X, it first consults its store buffer and then shared memory. Since X
is in store buffer, P1 will read 1 an P2 will read 2. However, if any of these stores come to shared memory it becomes the value to be read by both processors. For example, if X=2
(from P2's store buffer) comes first to shared memory, both P1 and P2 will read 2 (P1 will not read 1 anymore).
My question is: What happens with the store (X=1
) in other store buffer (P1's)? In other word, what happens with the store that "lost the race to shared memory"? Is it deleted somehow, or it stays in store buffer and eventually will find its way to shared memory, thus both P1 and P2 will in some moment start to read 1 instead of 2?
Russ Cox (Golang) in his hardware memory model paper wrote the following:
All the processors are still connected to a single shared memory, but each processor queues writes to that memory in a local write queue. The processor continues executing new instructions while the writes make their way out to the shared memory. A memory read on one processor consults the local write queue before consulting main memory, but it cannot see the write queues on other processors. The effect is that a processor sees its own writes before others do. **But—and this is very important—all processors do agree on the (total) order in which writes (stores) reach the shared memory, giving the model its name: total store order, or TSO. At the moment that a write reaches shared memory, any future read on any processor will see it and use that value (until it is overwritten by a later write, or perhaps by a buffered write from another processor).
The confusion comes from these two bolded parts since they seem contradictory to me. In first Russ says that the read will first consult its buffer and then shared memory, and in second he says that the buffered write can overwrite previous write. Thus, I am confused. If it consumes first buffer, and then shared memory, how will the processor "that lost the store race" (P1) be able to see the write of the processor "that won the store race" (P2)? It can only see if some update is done to its buffer. However, then the second bolded text doesn't make sense, since buffered write (X=1
) doesn't exist anymore and with "until it is overwritten by a later write" he covered any later write.
What you're asking about is covered by the basics of MESI cache coherency and the answer doesn't change even with relaxed memory ordering like ARMv8. (Or like the ISO C++ memory model.) Whichever core gets exclusive ownership of the cache line first can commit its store first, then the second core overwrites that value with its store. Loads see a value in the modification-order for that location.
If two cores both have pending stores, the modification-order hasn't been established yet, but those cores will see their own stores via store-forwarding. (Other cores won't see stores that haven't yet committed to L1d cache, because that's the only way for stores to become visible to other cores at all in a memory model that doesn't allow IRIW reordering. Commit to L1d cache is when a store becomes globally visible.)
Cores that have a pending store to a location will always see that as the most-recent, the value they load from there, because it's definitely later in the modification-order for that location than whatever's already in coherent cache. It's not yet established how much later, but (unless the store was the result of mis-speculation and gets rolled back along with loads of its value) it will commit at some point later than whatever store set the value that's already in cache.
If
X=2
from P2's buffer is written in shared memory before X=1 (stored in P1's buffer), what will P1 see when readsX
(1
from its buffer or2
from shared memory)? Also, what happens withX=1
in P1's buffer whenX=2
is written to shared memory?
Loads from P1 that are in program order before P1's store will see X=2 (or some earlier value if they read cache before even P2's store commits).
Loads from P1 that are in program order after its store will see X=1. This continues to be true even after that store commits to cache and P1 is reading it from cache instead of store-forwarding. i.e. P1's store comes after P2's store in the modification order for that location (C/C++ terminology), from the point-of-view of all cores including P1.
Things only get interesting (in terms of effects) when multiple cores have pending stores to the same location. Then two cores can read different values for the same location. Whichever one commits second will be the long-term value for that location, and the other core will start seeing that value. Like you'd expect for the core that did the store which turned out to be first.
It only really gets weird when you also have stores to multiple locations, since then store-forwarding effects aren't just StoreLoad reordering. You can have multiple cores observing their own stores early, before they become globally visible.
Related: Globally Invisible load instructions discusses the weird effects produced by store-forwarding. A core sees its own stores before they become globally visible; as long as the store is still pending in the store buffer, it's the latest store to that location from this core's PoV. Once it eventually commits (after getting exclusive ownership of the cache line via MESI read-for-ownership), it's part of the global total order of stores since caches are coherent.
See also Can x86 reorder a narrow store with a wider load that fully contains it? for a tricky but illustrative case.