Why is std::atomic
's store
:
std::atomic<int> my_atomic;
my_atomic.store(1, std::memory_order_seq_cst);
doing an xchg
when a store with sequential consistency is requested?
Shouldn't, technically, a normal store with a read/write memory barrier be enough? Equivalent to:
_ReadWriteBarrier(); // Or `asm volatile("" ::: "memory");` for gcc/clang
my_atomic.store(1, std::memory_order_acquire);
I'm explicitly talking about x86 & x86_64. Where a store has an implicit acquire fence.
xchg
or mov
-store + mfence
are both valid ways to implement a sequential-consistency store on x86. (When seq_cst loads are implemented with just mov
.) The implicit lock
prefix on an xchg
with memory makes it a full memory barrier, like all atomic RMW operations on x86.
Sequential Consistency requires that all seq_cst
operations happen in some interleaving of program order across threads, which includes blocking StoreLoad reordering between seq_cst
operations. (Although not necessarily between a seq_cst
store and later non-seq_cst
loads/stores. That's an x86 implementation detail, but AArch64 allows it the same way x86 allows seq_cst
loads to reorder with earlier release
or weaker stores.)
Plain mov
is not sufficient; it only has release semantics, not sequential-release. See Jeff Preshing's article on acquire / release semantics. Regular stores can reorder with later loads; x86's memory model is what 486 did naturally: program order plus a store buffer with store-forwarding. So we get release
and acquire
for free, but not seq_cst
. (By "regular", I mean anything that isn't part of a lock
ed instruction or xchg
, like mov
, or memory-src or memory-dst add
for example which is like x.store(x.load(acquire) + src, release)
.)
If a release-store is releasing a lock, it's ok for later stuff to appear to happen inside the critical section.
Compare AArch64's stlr
/ ldar
instructions: ldar
is a seq_cst load that can't reorder with earlier stlr
, but can reorder with other earlier stores or non-acquire loads. This is as strong as seq_cst
requires but no stronger. This choice is obviously motivated by C++11 having seq_cst as the default memory ordering, and by Java which only had seq_cst when AArch64 was on the drawing board. AArch64 later introduced ldapr
, a partially-ordered acquire
load that can reorder with earlier stlr
stores. x86's mov
loads are like ldapr
, but x86 lacks an equivalent to ldar
, so seq_cst on x86 needs a full barrier either before seq_cst stores or after seq_cst loads. Keeping loads cheap is generally best. AArch64's normal ldr
/str
are much weaker; relaxed not acquire / release.)
Why are x86's lock
ed instructions full barriers?
RMW atomicity only requires that no other thread can modify this cache line between the load and store. x86 could have allowed the load side of an atomic RMW to reorder with earlier stores, but instead chose to make lock
ed operations fully drain the store buffer before even its load can happen. But it makes sense to wait until we're almost ready to commit the store side before taking exclusive ownership of the cache line; during that time, other cores can't even load anything in the whole line.
My understanding is that internally (on modern x86 CPUs), a lock
ed instruction includes a load micro-op that locks the cache line (requiring MESI Exclusive or Modified state, and preventing the cache from responding to requests from other cores to share or invalidate it), followed by a store-unlock.
The current SMP memory model comes from 486, which probably still used an external bus-lock instead of just a cache-lock for aligned cacheable atomic-RMWs. But taking that early would be even more disastrous, blocking other cores from accessing any memory.
There are performance differences between mfence
and xchg
on different CPUs, and maybe in the hot vs. cold cache and contended vs. uncontended cases. And/or for throughput of many operations back to back in the same thread vs. for one on its own, and for allowing surrounding code to overlap execution with the atomic operation.
See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for actual benchmarks of mfence
vs. lock addl $0, -8(%rsp)
vs. (%rsp)
as a full barrier. (When you need a barrier that's not part of a store via xchg
, a dummy lock
ed operation is still better than mfence
.)
On Intel Skylake hardware, mfence
blocks out-of-order execution of independent ALU instructions, but xchg
doesn't. (See my test asm + results in the bottom of this SO answer). Intel's manuals don't require it to be that strong; only lfence
is documented to do that. But as an implementation detail, it's very expensive for out-of-order execution of surrounding code on Skylake.
I haven't tested other CPUs, and this may be a result of a microcode fix for erratum SKL079, SKL079 MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions. The existence of the erratum basically proves that SKL used to be able to execute instructions after MFENCE. I wouldn't be surprised if they fixed it by making MFENCE stronger in microcode, kind of a blunt instrument approach that significantly increases the impact on surrounding code.
I've only tested the single-threaded case where the cache line is hot in L1d cache. (Not when it's cold in memory, or when it's in Modified state on another core.) xchg
has to load the previous value, creating a "false" dependency on the old value that was in memory. But mfence
forces the CPU to wait until previous stores commit to L1d, which also requires the cache line to arrive (and be in M state). So they're probably about equal in that respect, but Intel's mfence
forces everything to wait, not just loads.
AMD's optimization manual recommends xchg
for atomic seq-cst stores. I thought Intel recommended mov
+ mfence
, which older gcc uses, but Intel's compiler also uses xchg
here.
When I tested, I got better throughput on Skylake for xchg
than for mov
+mfence
in a single-threaded loop on the same location repeatedly. See Agner Fog's microarch guide and instruction tables for some details, but he doesn't spend much time on locked operations.
See gcc/clang/ICC/MSVC output on the Godbolt compiler explorer for a C++11 seq-cst my_atomic = 4;
gcc uses mov
+ mfence
when SSE2 is available. (use -m32 -mno-sse2
to get gcc to use xchg
too). The other 3 compilers all prefer xchg
with default tuning, or for znver1
(Ryzen) or skylake
.
The Linux kernel uses xchg
for __smp_store_mb()
.
Update: recent GCC (like GCC10) changed to using xchg
for seq-cst stores like other compilers do, even when SSE2 for mfence
is available.
Another interesting question is how to compile atomic_thread_fence(mo_seq_cst);
. The obvious option is mfence
, but lock or dword [rsp], 0
is another valid option (and used by gcc -m32
when MFENCE isn't available). The bottom of the stack is usually already hot in cache in M state. The downside is introducing latency if a local was stored there. (If it's just a return address, return-address prediction is usually very good so delaying ret
's ability to read it is not much of a problem.) So lock or dword [rsp-4], 0
could be worth considering in some cases. (gcc did consider it, but reverted it because it makes valgrind unhappy. This was before it was known that it might be better than mfence
even when mfence
was available.)
All compilers currently use mfence
for a stand-alone barrier when it's available. Those are rare in C++11 code, but more research is needed on what's actually most efficient for real multi-threaded code that has real work going on inside the threads that are communicating locklessly.
But multiple source recommend using lock add
to the stack as a barrier instead of mfence
, so the Linux kernel recently switched to using it for the smp_mb()
implementation on x86, even when SSE2 is available.
See https://groups.google.com/d/msg/fa.linux.kernel/hNOoIZc6I9E/pVO3hB5ABAAJ for some discussion, including a mention of some errata for HSW/BDW about movntdqa
loads from WC memory passing earlier lock
ed instructions. (Opposite of Skylake, where it was mfence
instead of lock
ed instructions that were a problem. But unlike SKL, there's no fix in microcode. This may be why Linux still uses mfence
for its mb()
for drivers, in case anything ever uses NT loads to copy back from video RAM or something but can't let the reads happen until after an earlier store is visible.)
In Linux 4.14, smp_mb()
uses mb()
. That uses mfence is used if available, otherwise lock addl $0, 0(%esp)
.
__smp_store_mb
(store + memory barrier) uses xchg
(and that doesn't change in later kernels).
In Linux 4.15, smb_mb()
uses lock; addl $0,-4(%esp)
or %rsp
, instead of using mb()
. (The kernel doesn't use a red-zone even in 64-bit, so the -4
may help avoid extra latency for local vars).
mb()
is used by drivers to order access to MMIO regions, but smp_mb()
turns into a no-op when compiled for a uniprocessor system. Changing mb()
is riskier because it's harder to test (affects drivers), and CPUs have errata related to lock vs. mfence. But anyway, mb()
uses mfence if available, else lock addl $0, -4(%esp)
. The only change is the -4
.
In Linux 4.16, no change except removing the #if defined(CONFIG_X86_PPRO_FENCE)
which defined stuff for a more weakly-ordered memory model than the x86-TSO model that modern hardware implements.
x86 & x86_64. Where a store has an implicit acquire fence
You mean release, I hope. my_atomic.store(1, std::memory_order_acquire);
won't compile, because write-only atomic operations can't be acquire operations. See also Jeff Preshing's article on acquire/release semantics.
Or
asm volatile("" ::: "memory");
No, that's a compiler barrier only; it prevents all compile-time reordering across it, but doesn't prevent runtime StoreLoad reordering, i.e. the store being buffered until later, and not appearing in the global order until after a later load. (StoreLoad is the only kind of runtime reordering x86 allows.)
Anyway, another way to express what you want here is:
my_atomic.store(1, std::memory_order_release); // mov
// with no operations in between, there's nothing for the release-store to be delayed past
std::atomic_thread_fence(std::memory_order_seq_cst); // mfence
Using a release fence would not be strong enough (it and the release-store could both be delayed past a later load, which is the same thing as saying that release fences don't keep later loads from happening early). A release-acquire fence would do the trick, though, keeping later loads from happening early and not itself being able to reorder with the release store.
Related: Jeff Preshing's article on fences being different from release operations.
But note that seq-cst is special according to C++11 rules: only seq-cst operations are guaranteed to have a single global / total order which all threads agree on seeing. So emulating them with weaker order + fences might not be exactly equivalent in general on the C++ abstract machine, even if it is on x86. (On x86, all store have a single total order which all cores agree on. See also Globally Invisible load instructions: Loads can take their data from the store buffer, so we can't really say that there's a total order for loads + stores.)