I'm reading A Primer on Memory Consistency and Cache Coherence, Second Edition by Vijay Nagarajan et al. and I'm somewhat confused by their formal definition of TSO. They have three parts to the definition. The first requires that memory order respects program order except for stores followed by loads. The third defines fences as being something that operations can't be reordered across. And finally, the second - which is concerned with the values obtained by loads - says:
Every load gets its value from the last store before it to the same address:
• Value of L(a) = Value of MAX<m{ S(a) | S(a) <m L(a) or S(a) <p L(a) }
This last mind-bending equation says that the value of a load is the value of the last store to the same address that is either (a) before it in memory order or (b) before it in program order (but possibly after it in memory order), with option (b) taking precedence (i.e., write buffer bypassing overrides the rest of the memory system).
Perhaps I'm misreading it, but the phrasing seems to imply that a store from the same core will always take precedence over a store from another core, no matter how far in time the store and subsequent (in program order) load are separated.
As far as I understand it, this isn't quite right as (b) will only take precedence over (a) for the duration of the store being in the write buffer. In other words, from the programmer's side of the memory consistency contract of TSO, all you can say is that the load will get the value from the directly preceding store (to the same location) w.r.t. either the program order or the memory order; and you must account for both possibilities. Is this right?
SMALL ASIDE: Are TSO and TSO-x64 the same? They mostly seem to be used interchangeably, but there are times when their usage seems to imply some unstated difference?
(b) will only take precedence over (a) for the duration of the store being in the write buffer.
Correct.
The store buffer drains itself to L1d as fast as it can. Once that happens, the store is globally visible and becomes part of the total order of stores, and isn't special for this core reading it: snooping the store buffer won't find it for store-to-load forwarding.
A core sees its own stores as "most recent" until they commit, for each byte separately.
Slow-path store forwarding, aka store-forwarding stalls, have to merge data from cache with data from the store buffer, for a load that only partially overlaps a recent store. (There's reasonable evidence IIRC that in practice CPUs don't just stall until the store buffer drains to L1d cache in that case.) Or from multiple store-buffer entries for a wide load that overlaps multiple recent narrow stores. See Globally Invisible load instructions for discussion of the case where that can let a core load a value that was never visible to another core.
I don't know the notation the paper's using for their formal definition, but the text of what they say is compatible with this: ... with option (b) taking precedence (i.e., write buffer bypassing overrides the rest of the memory system). - it's not write-buffer bypassing (aka store forwarding) if the data isn't still in the store buffer.
Presumably their formal model has provisions for stores leaving the L
set of locally queued stores and becoming part of S
, the total store order, if that's what their symbols stand for.
Are TSO and TSO-x64 the same?
Probably? I don't know if SPARC's TSO has any differences from x86 in what's formally allowed or what CPUs can actually do in practice.