Two different threads within a single process can share a common memory location by reading and/or writing to it.
Usually, such (intentional) sharing is implemented using atomic operations using the lock
prefix on x86, which has fairly well-known costs both for the lock
prefix itself (i.e., the uncontended cost) and also additional coherence costs when the cache line is actually shared (true or false sharing).
Here I'm interested in produced-consumer costs where a single thread P
writes to a memory location, and another thread `C reads from the memory location, both using plain reads and writes.
What is the latency and throughput of such an operation when performed on separate cores on the same socket, and in comparison when performed on sibling hyperthreads on the same physical core, on recent x86 cores.
In the title I'm using the term "hyper-siblings" to refer to two threads running on the two logical threads of the same core, and inter-core siblings to refer to the more usual case of two threads running on different physical cores.
The killer problem is that the cores makes speculative reads, which means that each time a write to the the speculative read address (or more correctly to the same cache line) before it is "fulfilled" means the CPU must undo the read (at least if your an x86), which effectively means it cancels all speculative instructions from that instruction and later.
At some point before the read is retired it gets "fulfilled", ie. no instruction before can fail and there is no longer any reason to reissue, and the CPU can act as-if it had executed all instructions before.
Other core example
These are playing cache ping pong in addition to cancelling instructions so this should be worse than the HT version.
Lets start at some point in the process where the cache line with the shared data has just been marked shared because the Consumer has ask to read it.
So the Consumer can advance in the period between it gets it shared cache line until its invalidated again. It is unclear how many reads can be fulfilled at the same time, most likely 2 as the CPU has 2 read ports. And it properbly doesn't need to rerun them as soon as the internal state of the CPU is satisfied they can't they can't fail between each.
Same core HT
Here the two HT shares the core and must share its resources.
The cache line should stay in the exclusive state all the time as they share the cache and therefore don't need the cache protocol.
Now why does it take so many cycles on the HT core? Lets start with the Consumer just having read the shared value.
So for every read of the shared value the Consumer is reset.
Conclusion
The different core apparently advance so much each time between each cache ping pong that it performs better than the HT one.
What would have happened if the CPU waited to see if the value had actually changed?
For the test code the HT version would have run much faster, maybe even as fast as the private write version. The different core would not have run faster as the cache miss was covering the reissue latency.
But if the data had been different the same problem would arise, except it would be worse for the different core version as it would then also have to wait for the cache line, and then reissue.
So if the OP can change some of roles letting the time stamp producer read from the shared and take the performance hit it would be better.
Read more here