is it safe to read a pointer to an atomic without synchronization?

Is it safe to read a pointer to an atomic without synchronization, assuming it is never going to be a nullptr ? like in the following code, assume there are two threads running writer and reader concurrently.

std::atomic<int>* g_atomic = new std::atomic<int>{};

void writer()
{
    for (int i = 0; i < 101; i++)
    {
        auto* new_atomic = new std::atomic<int>{i};
        std::atomic_thread_fence(std::memory_order_seq_cst); // memory barrier.
        g_atomic = new_atomic; // ignore the memory leak
    }
}

void reader()
{
    auto value = g_atomic->load();
    while (value < 100)
    {
        assert(value >= 0 && value <= 100);
        value = g_atomic->load();
    }
}

by safe i mean, i will always read a value from 0 to 100, i won't read an invalid pointer or read the pointed-to object before its initialization.

my intuition tells me this is safe because

pointers are read or written atomically on all architectures.
the pointed-to value is atomically read, it has to be fetched from the RAM, and the memory barrier before the write guarantees the RAM is always correct.

so, is this safe ? maybe only on all common architectures ?

Solution

pointers are read or written atomically on all architectures.

Yes, in assembly. But you're writing in C++, where what matters is the language's memory model and its abstract machine. Your code has data-race UB from concurrent unsynchronized write+read on g_atomic.

In practice the reader loop will hoist the load of g_atomic out of your loop. Since it's not atomic, it would be UB for another thread to be changing it, which the compiler can assume doesn't happen. In other words, making data races on non-atomic variables UB is what lets compilers continue to do important optimizations for single-threaded code. (They don't know which globals are shared or not.)

In C++, if you want the behaviour you hoped for, you have to ask for it with std::atomic< T* >.
With relaxed store and load if you don't need any ordering.

But you actually do need ordering so you don't publish the pointer before the pointed-to object is fully constructed. You're using a slow thread_fence(seq_cst) to get that; you actually only need release, preferably .store(val, release) instead of a separate fence so it can compile to AArch64 stlr instead of a separate barrier instruction.

In your case, T is std::atomic<int> aka std::atomic_int. So your code should be written like this to compile to the asm you thought you'd get from the original. I renamed g_atomic to g_ptr because it and the pointed-to thing both have to be atomic.

(You actually don't need the pointed-to objects to be atomic for this to be safe. So std::atomic< int* > g_ptr;, assuming you aren't going to have threads writing those pointed-to objects. The initialization happens-before the read thanks to synchronization between writer and reader. Not formally true here since I used relaxed in the reader because consume is deprecated and this compiles to the correct asm in practice.)

#include <atomic>
using std::memory_order::relaxed, std::memory_order::release;
std::atomic<int> initval = 0;
std::atomic< std::atomic<int>* > g_ptr = &initval;

void writer()
{
    for (int i = 0; i < 101; i++)
    {
        auto* new_atomic = new std::atomic<int>{i};
        //std::atomic_thread_fence(std::memory_order_release);
        g_ptr.store(new_atomic, release);
    }
}

void reader()
{
    auto value = g_ptr.load(relaxed)->load(relaxed);
    while (value < 100)
    {
        //assert(value >= 0 && value <= 100);  // the loop condition already checks for <100, what's the point here?  Just the >= 0 part?
        auto tmp_ptr = g_ptr.load(relaxed);  // formally should be consume, but compilers strengthen it unnecessarily.
        value = tmp_ptr->load(relaxed);
        //value = g_ptr.load(relaxed)->load(relaxed);  // or without a tmp var
    }
}

(Godbolt)

the pointed-to value is atomically read

Yes because it's std::atomic<int>.

it has to be fetched from the RAM

Well from shared memory, so probably actually from cache. Cache is coherent between cores that std::thread runs across, on all real-world C++ implementations. (On boards with cores that share memory but don't have cache-coherency, e.g. ARM DSP + microcontroller, you don't run threads of the same program across those cores.)

In C++ you formally need at least std::memory_order_consume, but it's been deprecated and then removed since the original definition was too complicated to implement. Compilers only ever implemented it by promoting it to acquire, not actually taking advantage of memory dependency ordering that all ISAs except DEC Alpha provide.

In this case there's no plausible way for a compiler to figure out what value the pointer load must have produced, so has to make asm that loads and then dereferences the pointer. So there's a data dependency between the first pointer.load(relaxed) and tmp->load(relaxed), and real ISAs guarantee that there can't be LoadLoad reordering in that case. So it works in practice; the Linux kernel uses this for RCU for example, to avoid any memory barriers in read paths.

See C++11: the difference between memory_order_relaxed and memory_order_consume and similar answers about consume and memory dependency ordering.