Why does QEMU use __atomic_thread_fence() together with barrier()?

QEMU atomic.h has these definitions:

#define smp_mb()                     ({ barrier(); __atomic_thread_fence(__ATOMIC_SEQ_CST); })
#define smp_mb_release()             ({ barrier(); __atomic_thread_fence(__ATOMIC_RELEASE); })
#define smp_mb_acquire()             ({ barrier(); __atomic_thread_fence(__ATOMIC_ACQUIRE); })

And it has comments explaining why barrier(), a compiler barrier, is necessary:

/* Manual memory barriers
 *
 *__atomic_thread_fence does not include a compiler barrier; instead,
 * the barrier is part of __atomic_load/__atomic_store's "volatile-like"
 * semantics. If smp_wmb() is a no-op, absence of the barrier means that
 * the compiler is free to reorder stores on each side of the barrier.
 * Add one here, and similarly in smp_rmb() and smp_read_barrier_depends().
 */

I haven't used __atomic_thread_fence before, but my searches on the net show that __atomic_thread_fence prevents both compiler and CPU from reordering memory access. For example, its reference page here and here doesn't say it's only a CPU barrier. And an answer here says explicitly that it's both a compiler barrier and CPU barrier.

Does that mean barrier() in those definitions is redundant? (I'm just curious)

Solution

It's redundant for smp_mb: __atomic_thread_fence(__ATOMIC_SEQ_CST); doesn't let any operations reorder in either direction. But does no harm so might as well leave it in for consistency.

It's not redundant with RELEASE or ACQUIRE fences. On paper, even ACQ_REL fences allow reordering earlier stores with later loads (StoreLoad). So the compiler is allowed to do that at compile time, as well as not emitting instructions to stop it from happening at run-time.

But the Linux kernel's definitions of smp_rmb() and smp_wmb() are in terms of asm("..." ::: "memory") GNU C inline asm which blocks all compile-time reordering.
Linux barrier() is defined as asm("" ::: "memory").

In practice, GCC probably treats any __atomic_thread_fence as a full compiler barrier; see Does gcc treat relaxed atomic operation as a Compiler-fence? - GCC currently won't even optimize increment of the same variable before and after a relaxed operation. But Clang will optimize.

Practical demo of the difference

int read_twice(int* x) {
  int tmp = *x;
    //barrier();
    __atomic_thread_fence(__ATOMIC_RELEASE); // Doesn't block LoadLoad
  tmp += *x;
  return tmp;
}

The latest GCC loads twice.
Clang correctly optimizes it to a single load without barrier(), but can't with it. (Godbolt)

# x86-64 clang 19, NO barrier()
read_twice(int*):
        mov     eax, dword ptr [rdi]
        add     eax, eax
        ret

# x86-64 clang 19, WITH barrier()
read_twice_barrier(int*):
        mov     eax, dword ptr [rdi]
        add     eax, dword ptr [rdi]
        ret

Obviously this is a silly example where the barrier makes no sense, but keep in mind that optimizations are possible after inlining small functions.

Code that would break without barrier() is probably already unsafe, e.g. probably using non-atomic (and non-volatile) accesses to shared variables without synchronization. In code that uses fences properly (and/or atomic loads with appropriate memory orders), optimizations allowed without barrier() will still be safe.

See also Who's afraid of a big bad optimizing compiler? re: the perils of plain accesses to shared data: as well as the obvious pitfalls, there can be subtle effects like invented loads where a temporary is optimized away and the compiler reloads the shared data.

But anyway, for full belt-and-suspenders strict compatibility with the Linux kernel smp_* memory barrier functions, blocking all compile-time reordering across them is correct.

https://preshing.com/20130922/acquire-and-release-fences/ excellent easy-to-read explanation of acquire and release fences. See also acq and rel semantics for operations (like load or store). And a followup article about fences - unlike operations, an acquire fence is a 2-way barrier for loads, for example, otherwise it couldn't work.
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#index-_005f_005fatomic_005fthread_005ffence - GCC's __atomic builtins are intended to implement the behaviour of the corresponding C++ std::atomic and C stdatomic functions, but do have their own documentation. (Which doesn't shed any light on things in this case, except for the fact that it doesn't document them as full barriers to compile-time reordering. So it's not safe to assume that.)
Who's afraid of a big bad optimizing compiler?