multithreadingc++11c11memory-barriershappens-before

The sequential consistent order of C++11 vs traditional GCC built-ins like `__sync_synchronize`


So I've came across Jeff Preshing's wonderful blog posts on what's Acquire/Release and how they may be achieved with some CPU barriers.

I've also read that SeqCst is about some total order that's guaranteed to be consistent with not coherence-after relation - though at times it might contradict with happens-before relation established by plain Acquire/Release operations due to historical reasons.

My question is, how does the old GCC built-ins map into the memory model proposed by C++11 (and later revisions)? In particular, how to map __sync_synchronize() into C++11 or later modern C/C++?

In the GCC manual this call is simply described as a full memory barrier, which I suppose is the combination of all four major kind of barrier i.e. LoadLoad/LoadStore/StoreLoad/StoreStore barriers all at once. But is sync_synchronize equivalent to std::atomic_thread_fence(memory_order_seq_cst)? Or maybe, formally speaking, one of them is stronger than the other (which I suppose is the case here: in general a SeqCst fence should be stronger, since it requires the toolchain/platform to improvise a global ordering somehow, no?), and it just happens that most of the CPUs out there provides only instructions that satisfies both (full memory barrier by __sync_synchronize, total sequential ordering by std::atomic_thread_fence(memory_order_seq_cst)) at once, for example x86 mfence and PowerPC hwsync?

Either __sync_synchronize and std::atomic_thread_fence(memory_order_seq_cst) are formally equal or they are effectively equal (i.e. formally speaking they are different but no commercialized CPU bother to differentiate between the two), technically speaking a memory_order_relaxed load on the same atomic still may not be relied upon to synchronize-with/create happens-before relation with it, no?

I.e. technically speaking all of these assertions are allowed to fail, right?

// Experiment 1, using C11 `atomic_thread_fence`: assertion is allowed to fail, right?

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
atomic_store_explicit(&lock, true, memory_order_relaxed);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // We should really `memory_order_acquire` the `lock`
    // or `atomic_thread_fence(memory_order_acquire)` here,
    // or this assertion may fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}
// Experiment 2, using `SeqCst` directly on the atomic store

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_store_explicit(&lock, true, memory_order_seq_cst);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // Again we should really `memory_order_acquire` the `lock`
    // or `atomic_thread_fence(memory_order_acquire)` here,
    // or this assertion may fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}
// Experiment 3, using GCC built-in: assertion is allowed to fail, right?

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
__sync_synchronize();
atomic_store_explicit(&lock, true, memory_order_relaxed);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // we should somehow put a `LoadLoad` memory barrier here,
    // or the assert might fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}

I've tried these snippets on my RPi 5 but I don't see assertions fails. Yes this doesn't formally prove anything but it also doesn't shed light on differentiating between __sync_synchronize and std::atomic_thread_fence(memory_order_seq_cst).


Solution

  • Yes, __sync_synchronize() is at least in practice equivalent to std::atomic_thread_fence(memory_order_seq_cst).

    Formally, __sync_synchronize() operates in terms of memory barriers and blocking memory reordering, since it predates the existence of C++11's formal memory model. atomic_thread_fence operates in terms of C++11's memory model; compiling to a full-barrier instruction is an implementation detail.

    So for example it's not required by the standard for thread_fence to do anything in a program where there aren't any std::atomic<> objects because its behaviour is only defined in terms of atomics. While __sync_synchronize() (and thread_fence in practice as an implementation detail in GCC/clang) could let you hack something up in terms of synchronizing on plain int variables. That's UB in C++11, and a bad idea even in terms of a known implementation like GCC; see Who's afraid of a big bad optimizing compiler? re: the obvious vs. non-obvious badness (like invented loads) that can happen when you just use memory barriers instead of std::atomic with relaxed for shared variables to stop a compilers from keeping them in registers.

    But my point is, in practice they work the same, but they're from different memory models: the __sync builtins are in terms of barriers against local reordering of accesses to cache-coherent shared memory (i.e. a CPU-architecture view), vs. C++11 std::atomic stuff being in terms of its formalism with modification orders and syncs-with / happens-before. Which formally allows some things that aren't plausible on a real CPU which uses cache-coherent shared memory.


    Yes, in your code blocks, the assertion could fail on a CPU where LoadLoad reordering is possible. It's probably not possible with both variables in the same cache line. See C++ atomic variable memory order problem can not reproduce LoadStore reordering example for another case of trying to repro memory-reordering.