c++multithreading memory-model memory-fences

How effective a barrier is a atomic write followed by an atomic read of the same variable?

Consider the following:

#include <atomic>

std::atomic<unsigned> var;
unsigned foo;
unsigned bar;

unsigned is_this_a_full_fence() {
     var.store(1, std::memory_order_release);
     var.load(std::memory_order_acquire);
     bar = 5;
     return foo;
}

My thought is the dummy load of var should prevent the subsequent variable accesses of foo and bar from being reordered before the store.

It seems the code creates a barrier against reordering - and at least on x86, release and acquire require no special fencing instructions.

Is this a valid way to code a full fence (LoadStore/StoreStore/StoreLoad/LoadLoad)? What am I missing?

I think the release creates a LoadStore and StoreStore barrier. The acquire creates a LoadStore and LoadLoad barrier. And the dependency between the two variable accesses creates a StoreLoad barrier?

EDIT: change barrier to full fence. Make snippet C++.

Solution

One major issue with this code is that the store and subsequent load to the same memory location are clearly not synchronizing with any other thread. In the C++ memory model races are undefined behavior, and the compiler can therefore assume your code didn't have a race. The only way that your load could observe a value different from what was stored is if you had a race. The compiler can therefore, under the C++ memory model, assume that the load observes the stored value.

This exact atomic code sequence appears in my C++ standards committee paper no sane compiler would optimize atomics under "Redundant load eliminated". There's a longer CppCon version of this paper on YouTube.

Now imagine C++ weren't such a pedant, and the load / store were guaranteed to stay there despite the inherent racy nature. Real-world ISAs offer such guarantees which C++ doesn't. You provide some happens-before relationship with other threads with acquire / release, but you don't provide a unique total order which all threads agree on. So yes this would act as a fence, but it wouldn't be the same as obtaining sequential consistency, or even total store order. Some architectures could have threads which observe events in a well-defined but different order. That's perfectly fine for some applications! You'll want to look into IRIW (independent reads of independent writes) to learn more about this topic. The x86-TSO paper discusses it specifically in the context of the ad-hoc x86 memory model, as implemented in various processors.