c++multithreading atomic volatile stdatomic

Does atomic read guarantee reading of the latest value?

In C++ we have a volatile keyword and an atomic class. Difference between them is that volatile does not guarantee thread-safe concurrent reading and writing, but ensures that compiler will not store variable's value in cache and instead will load it directly from the memory, while atomic guarantees thread-safe concurrent reading and writing.

As we know, an atomic read operation is indivisible, i.e. neither thread can write a new value to the variable while one or more threads are reading that variable's value. That makes me think that we always read the latest value, but I'm not sure :)

So, my question is: if we declare a variable as atomic, do we always get the latest value of the variable when calling load() operation?

Solution

On real CPUs, loads take the newest value they can see at the moment they execute (i.e. when they take a value from cache, or from store-forwarding from their own earlier store). Stores can be in the store buffer, not yet visible to any other cores. CPUs like to load early so data is ready ASAP for out-of-order exec, and store late so a store buffer can decouple speculative out-of-order exec from access to cache, and from cache misses. But out-of-order exec windows are under 1000 instructions, and at multiple instructions per clock cycle on multi-GHz CPUs, the total OoO exec window size is small in nanoseconds, and trying to fight it will make your thread run slower. You often need ordering between operations within the same thread to establish synchronization with whatever thread stored the value you loaded, and that's what memory_order is for.

"Latest" is generally not well-defined when there are stores from other cores in flight. You'd have to define the moment store "happens", in terms of its existence meaning the previous value is no longer the latest, but there's no obvious point to pick in the process of how a modern CPU executes a store instruction by writing address + data into a store buffer and later committing that to L1d cache at some point after the store retires from out-of-order execution (which means it's now non-speculative, all previous ops have been found to not fault or be mispredicted branches etc.), and after getting MESI Exclusive ownership of the cache line to maintain coherence with other caches on other cores.

"Latest" is generally not a useful way to think about things. For correctness, ordering is what matters. For performance on real CPUs, inter-thread latency is often relevant: how long it takes in best / average / worst cases for a store to become visible to loads in other threads. It can never be instant.

Understanding how hardware works can give you a better idea of what you're going to get in practice.

On a typical x86 desktop, it takes maybe 40 nanoseconds for a store in one core to invalidate the cache line before it can commit its store, after which no other cores have a cached value they can read. (Of course they could have loaded at some earlier time before the invalidate, with out-of-order exec, but that's a small time window and blocking it would hurt the common fast case a lot.)

https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ provides a useful mental model for memory ordering which is easier to think about than the C++ formalism and closer to real CPUs, of a shared "server" (coherent cache) and different agents accessing it (CPU cores), and maybe using barriers to order their own accesses to it. (Some hardware, like IBM POWER, can store-forward from one logical core to another before a store becomes visible to all cores, but most hardware like x86 and AArch64 work like that model, with stores only becoming visible to other cores via coherent cache.)

Nothing you do in the reader thread can make a load see a value from a store in another thread except waiting, e.g. retrying the load in a spin-wait loop (perhaps using .wait() / .notify_all() or a timed sleep). And nothing you do in the writer thread can make a store commit much sooner, the store buffer drains on its own as fast as it can. All you can do is delay other stuff until that happens. Memory barriers just order your own cache accesses wrt. to each other.

When people start talking about wanting the "latest value", the C++ standard's guarantee for atomic RMWs (like .exchange or .fetch_add(0)) often get brought up, often misleadingly phrased as atomic RMWs being guaranteed to read "the latest value". The actual guarantee in [atomics.order]p10 is that they read the last value in the mod order before the store part, and has nothing to do with time or freshness / staleness, only atomicity of the RMW on that object. This is what serializes atomic RMWs on the same object with each other, and doesn't make them better for reading in general.

On real CPUs, the load side of an RMW does have to wait for MESI exclusive ownership of the cache line, unlike plain loads. Or if a CPU actually reads while still in Shared state, it has to verify that the cache line wasn't invalidated between then and getting Exclusive ownership. And either restart the RMW or have the SC part of the LL/SC report failure.

But other stores which end up later in the modification order can already be in the store buffers of other cores while an RMW succeeds, and even have been read by store-forwarding on those cores. See Is the order of a side effect in the modification order determined by when the side effect is produced?

There is a C++ guarantee that a consistent modification order exists for each atomic variable separately. And if you've seen one value for that variable, later reads in the same thread are guaranteed to see that value or later. (Read Read coherence and so on, intro.races p19 in the standard.)
On real hardware, this order is established by cores getting exclusive ownership of the cache line holding the variable and committing stores to cache.

A load will see a very recent value if there are ongoing stores

If there was only one recent store, it will see it or not

On real systems, if it was longer ago than maybe 100 nanoseconds, or maybe a microsecond or two in really high contention cases, loads in other threads will see it. (Where the time of the store is what an rdtsc would have seen if you'd done one in the same thread as the store. i.e. before it even retires and sends out a request to other cores to invalidate their copies.)

i.e. I'm proposing a definition of simultaneity where the writer and reader both run an rdtsc instruction within a few cycles of when their store and load executes in the out-of-order back end. That's very different from when readers can actually expect to see stores from other threads.

Even a seq_cst atomic RMW doesn't wait for other cores to drain their store buffers (or make it happen any faster) to make executed but not committed stores visible, so it's not fundamentally better.

Re: "latest value" concerns, see the following.

Another answer on this question suggests that stale data would be possible if the compilers didn't emit extra asm to explicitly "publish" stored data (make it globally visible). But all real systems have coherent cache across all the cores that C++ std::thread will start threads across. It's hypothetically possible to have std::thread run across cores with non-coherent shared memory, but would be extremely slow. See When to use volatile with multi threading? - never, obsoleted by C++11, but legacy code (and the Linux kernel) still use volatile to roll their own atomics.

Just a plain store instruction in assembly creates inter-core visibility because hardware is cache-coherent, using MESI. That's what you get from volatile. No "publish" is necessary. If you want this core to wait until the store is globally visible before doing later loads/stores, that's what a memory barrier does, to create ordering between this store and operations on other objects. Nothing to do with guaranteeing or speeding up visibility of this store.

The default std::memory_order is seq_cst; plain volatile is like relaxed on C++ implementations where it works for hand-rolled atomics. In ISO C++ volatile has undefined behaviour on data races, only atomic makes that safe. But real implementations, other than clang -fsanitize=thread or similar, don't do race detection.

Of course don't actually use volatile for threading. I mention this only to help understanding of how CPUs work, for thinking about performance and to help debugging accidental data races. C/C++11 made volatile obsolete for that purpose. Unless you're writing Linux kernel code (and then use their macros which just happen to use volatile under the hood).

BTW, volatile doesn't stop loads/stores from hitting in cache, but doesn't need to because cache is coherent. It stops compilers from "caching" the value in a register which is thread-private, nothing to do with hardware CPU caches. With compilers like GCC that supported volatile for pre-C++11 lock-free stuff, it compiles to pretty much the same asm that std::atomic with memory_order_relaxed gives you, so use that instead if it's what you want, because as you say, the C++ standard doesn't guarantee anything for volatile.