Understanding `memory_order_acquire` and `memory_order_release` in C++11

I'm reading through the documentation and more specifically

memory_order_acquire: A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread (see Release-Acquire ordering below).

memory_order_release: A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below)

These two bits:

from memory_order_acquire

... no reads or writes in the current thread can be re-ordered before this load...

from memory_order_release

... no reads or writes in the current thread can be re-ordererd after this store...

What exactly do they mean?

There's also this example

#include <thread>
#include <atomic>
#include <cassert>
#include <string>

std::atomic<std::string*> ptr;
int data;

void producer()
{
    std::string* p  = new std::string("Hello");
    data = 42;
    ptr.store(p, std::memory_order_release);
}

void consumer()
{
    std::string* p2;
    while (!(p2 = ptr.load(std::memory_order_acquire)))
        ;
    assert(*p2 == "Hello"); // never fires
    assert(data == 42); // never fires
}

int main()
{
    std::thread t1(producer);
    std::thread t2(consumer);
    t1.join(); t2.join();
}

But I cannot really figure where the two bits I've quoted apply. I understand what's happening but I don't really see the re-ordering bit because the code is small.

Solution

Acquire and Release are Memory Barriers. If your program reads data after an acquire barrier you are assured you will be reading data consistent in order with any preceding release by any other thread in respect of the same atomic variable. Atomic variables are guaranteed to have an absolute order (when using memory_order_acquire and memory_order_release though weaker operations are provided for) to their reads and writes across all threads. These barriers in effect propagate that order to any threads that are using that atomic variable. You can use atomics to indicate something has 'finished' or is 'ready' but if the consumer reads other data than the atomic variable the consumer can't rely on 'seeing' the right 'versions' of other memory and atomics would have limited value.

The statements about 'moving before' or 'moving after' are instructions to the optimizer that it shouldn't re-order operations to take place out of order. Optimizers are very good at re-ordering instructions and even omitting redundant reads/writes but if they re-organise the code across the memory barriers they may unwittingly violate that order.

Your code relies on (a) the std::string object having been constructed in producer() before ptr is assigned and (b) the constructed version of that string (i.e. the version of the memory it occupies) being the one that consumer() reads. Put simply consumer() is going to eagerly read the string as soon as it sees ptr assigned so it better see a valid and fully constructed object or bad times will ensue. In that code 'the act' of assigning ptr is how producer() 'tells' consumer the string is 'ready'. The memory barrier exists to make sure that's what the consumer sees.

Conversely if ptr was declared as an ordinary std::string * then the compiler could decide to optimize p away and assign the allocated address directly to ptr and only then construct the object and assign the int data. That is likely a disaster for the consumer thread which is using that assignment as the indicator that the objects producer is preparing are ready. To be accurate if ptr were a pointer the consumer may never see the value assigned or on some architectures read a partially assigned value where only some of the bytes have been assigned and it points to a garbage memory location. However those aspects are about it being atomic not the wider memory barriers.

Footnote: The code provides a good demonstration of memory barriers and the asserts() that never fire illustrate the memory order guarantee. However it's worth noting the design is not recommended for a scalable system. That is because the consumer thread performs 'busy waiting'. The thread loops until the string object is assigned to ptr. The thread will be competing for compute cycles with other threads.
A more scalable design would use a std::condition_variable which provides as a 'dormant waiting'. Of course the code provided is an illustrative example and the time spent busy waiting can be expected to be very brief. But in general busy waiting by consumers is not a recommended implementation of the producer/consumer pattern.