c++assembly x86 memory-barriers stdatomic

Is fail ordering relevant for x86 atomic operation?

Consider the definition of compare_and_exchange_strong_explicit:

_Bool atomic_compare_exchange_strong_explicit( volatile A* obj,
                                               C* expected, C desired,
                                               memory_order succ,
                                               memory_order fail );

I'm confused by the use-case of memory_order fail. Considering x86, it's perfectly clear that lock cmpxchg may fail, but it's also clearly defined that (emphasize mine)

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions

which makes the memory_order fail kind of irrelevant since lock guarantees sequential consistency in any case.

Example:

#include <stdatomic.h>

void fail_seqcst(volatile int *p, int *expected, int *desirable){
    atomic_compare_exchange_strong_explicit(p, expected, desirable, memory_order_release, memory_order_seq_cst);
}

void fail_relaxed(volatile int *p, int *expected, int *desirable){
    atomic_compare_exchange_strong_explicit(p, expected, desirable, memory_order_release, memory_order_relaxed);
}

Compiles to:

fail_relaxed:
        mov       ecx, edx                                      
        mov       eax, DWORD PTR [rsi]                          
        lock      
        cmpxchg   DWORD PTR [rdi], ecx                          
        mov       DWORD PTR [rsi], eax                          
        sete      al                                            
        movzx     eax, al                                       
        ret 

fail_seqcst:
        mov       ecx, edx                                      
        mov       eax, DWORD PTR [rsi]                          
        lock      
        cmpxchg   DWORD PTR [rdi], ecx                          
        mov       DWORD PTR [rsi], eax                          
        sete      al                                            
        movzx     eax, al                                       
        ret

Godbolt

Is there any optimization the compiler may do which would differentiate the code for memory_order_relaxed and memory_order_seq_cst in such case on x86? Or maybe there's an architecture out there which would may make this difference significant?

Solution

The fail order is only relevant in asm for non-x86 where separate barrier instructions are often needed, which can be placed only in the path of execution for the success case.

It can allow compile-time reordering, although current x86 compilers might not do it. In code that branches on a CAS attempt, a relaxed load (or non-atomic read) before the CAS could be moved into the failure path if its result is only used there. But only if the CAS failure order isn't a release operation, only acquire or relaxed. (.load(acquire) or stronger can't reorder this way since it would happen-before the success path too, and the before CAS itself perhaps ruling out some values.)

All possible choices of memory_order for CAS_weak and CAS_strong generate the same x86 asm as atomic_compare_exchanges_strong(ptr, &expected, desired, seq_cst), which is lock cmpxchg.

Whether it succeeds or fails, lock cmpxchg is a full barrier like atomic_thread_fence(seq_cst) (unlike a seq_cst load, store or RMW on a weakly-ordered ISA such as AArch64, where SC RMW wouldn't prevent a later relaxed store from reordering with an earlier relaxed store, e.g. the later one can reorder with the store side of the RMW and the earlier one can reorder with the load side of the RMW). Operations are different from fences in ISO C++, and on some real ISAs. So it's much stronger than ISO C++ requires for a seq_cst operation, as strong as a full fence.

And x86 lock cmpxchg even dirties the cache line on failure (again unlike an LL/SC implementation, which branches over the store attempt if the compare is false.) Although even the LL might try to get the cache line into exclusively-owned state so it might not be much better for contention, even if it avoids later write-back. I'm not sure how this works.

As far as the CPU is concerned, lock cmpxchg is just different ALU and flag-setting vs. always-succeed operations like lock add or xchg. These are all full barriers, and in fact a dummy lock or dword [rsp], 0 or similar is how most compilers implement atomic_thread_fence(seq_cst) because it's faster than mfence.

Intel even documents it as always storing so the processor never produces a locked read without also producing a locked write, although the same sentence talks about "the interface to the processor's bus". Aligned locked operations on normal (cacheable) memory regions just take a cache lock (delaying MESI response), not anything that's actually visible on a bus outside the CPU core. So that's probably not really meaningful documentation anymore, more of an as-if thing since P6 (Pentium Pro) in 1995 at least.