Consider the definition of compare_and_exchange_strong_explicit
:
_Bool atomic_compare_exchange_strong_explicit( volatile A* obj,
C* expected, C desired,
memory_order succ,
memory_order fail );
I'm confused by the use-case of memory_order fail
. Considering x86, it's perfectly clear that lock cmpxchg
may fail, but it's also clearly defined that (emphasize mine)
Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions
which makes the memory_order fail
kind of irrelevant since lock
guarantees sequential consistency in any case.
Example:
#include <stdatomic.h>
void fail_seqcst(volatile int *p, int *expected, int *desirable){
atomic_compare_exchange_strong_explicit(p, expected, desirable, memory_order_release, memory_order_seq_cst);
}
void fail_relaxed(volatile int *p, int *expected, int *desirable){
atomic_compare_exchange_strong_explicit(p, expected, desirable, memory_order_release, memory_order_relaxed);
}
Compiles to:
fail_relaxed:
mov ecx, edx
mov eax, DWORD PTR [rsi]
lock
cmpxchg DWORD PTR [rdi], ecx
mov DWORD PTR [rsi], eax
sete al
movzx eax, al
ret
fail_seqcst:
mov ecx, edx
mov eax, DWORD PTR [rsi]
lock
cmpxchg DWORD PTR [rdi], ecx
mov DWORD PTR [rsi], eax
sete al
movzx eax, al
ret
Is there any optimization the compiler may do which would differentiate the code for memory_order_relaxed
and memory_order_seq_cst
in such case on x86
? Or maybe there's an architecture out there which would may make this difference significant?
The fail order is only relevant in asm for non-x86 where separate barrier instructions are often needed, which can be placed only in the path of execution for the success case.
It can allow compile-time reordering, although current x86 compilers might not do it. In code that branches on a CAS attempt, a relaxed load (or non-atomic read) before the CAS could be moved into the failure path if its result is only used there. But only if the CAS failure order isn't a release operation, only acquire or relaxed. (.load(acquire)
or stronger can't reorder this way since it would happen-before the success path too, and the before CAS itself perhaps ruling out some values.)
All possible choices of memory_order
for CAS_weak and CAS_strong generate the same x86 asm as atomic_compare_exchanges_strong(ptr, &expected, desired, seq_cst)
, which is lock cmpxchg
.
Whether it succeeds or fails, lock cmpxchg
is a full barrier like atomic_thread_fence(seq_cst)
(unlike a seq_cst load, store or RMW on a weakly-ordered ISA such as AArch64, where SC RMW wouldn't prevent a later relaxed store from reordering with an earlier relaxed store, e.g. the later one can reorder with the store side of the RMW and the earlier one can reorder with the load side of the RMW). Operations are different from fences in ISO C++, and on some real ISAs. So it's much stronger than ISO C++ requires for a seq_cst operation, as strong as a full fence.
And x86 lock cmpxchg
even dirties the cache line on failure (again unlike an LL/SC implementation, which branches over the store attempt if the compare is false.) Although even the LL might try to get the cache line into exclusively-owned state so it might not be much better for contention, even if it avoids later write-back. I'm not sure how this works.
As far as the CPU is concerned, lock cmpxchg
is just different ALU and flag-setting vs. always-succeed operations like lock add
or xchg
. These are all full barriers, and in fact a dummy lock or dword [rsp], 0
or similar is how most compilers implement atomic_thread_fence(seq_cst)
because it's faster than mfence
.
Intel even documents it as always storing so the processor never produces a locked read without also producing a locked write, although the same sentence talks about "the interface to the processor's bus". Aligned lock
ed operations on normal (cacheable) memory regions just take a cache lock (delaying MESI response), not anything that's actually visible on a bus outside the CPU core. So that's probably not really meaningful documentation anymore, more of an as-if thing since P6 (Pentium Pro) in 1995 at least.