c++cpu-architecture memory-barriers stdatomic rdma

C++ atomics and memory_order with RDMA

When using one-sided RDMA on modern memory, lock-free, the question arises of how a remote reader can safely view their incoming data if the data objects span multiple cache lines.

In the Derecho open-source multicast and replicated logging library (on https://GitHub.com/Derecho-Project) we have this pattern. A writer W is granted permission to write to a range of memory in a reader, R. Memory is properly pinned and mapped. Now, suppose that the write involves some sort of vector of data spanning many cache lines, which is common. We use a guard: a counter (also in RDMA accessible memory, but in some other cache line) that gets incremented. R spins, watching the counter… when it sees a change, this tells R “you have a new message”, and R then reads the data in the vector. Later we have a second pattern whereby R says to W, “I am done with that message, you can send another one.”

My question: With modern memory models, which flavor of C++ atomic should be used for the memory into which the vector will be written? Would this be denoted as relaxed consistency? I want my code to work on ARM and AMD, not just Intel with its strong TSO memory model.

Then for my counter, when R spins watching for the counter update, how do I want the counter declared? Would it need to be declared as an acquire-release atomic?

Finally, is there any merit to declaring everything as relaxed, but then using a memory_order fence here, after R observes the counter to have been incremented, in terms of speed or correctness? My thinking is that with this second approach, I use a minimum consistency model on all the RDMA memory (and the same model for all such memory), plus I only need to invoke the more costly memory_order fence after the counter is observed to increment. So it happens just once, prior to accessing my vector, whereas the acquire release atomic counter would trigger a memory fencing mechanism every time my polling thread loops. To me this sounds hugely expensive.

That last thought leads to one more question: must I also declare this memory as volatile, so that the C— compiler will realize the data can change under its feet, or does it suffice that the compiler itself can see the std::atomic type declarations? On Intel, with total store ordering, TSO plus volatile is definitely needed.

[Edit: New information] (I'm trying to attract a bit of help here!)

One option seems to be to declare the RDMA memory region as std::atomic<relaxed_consistency> but then to use a lock every time our predicate evaluation thread retests the guard (which, being in RDMA memory, would be declared with this same relaxed property). We would retain the C++ volatile annotation.

The reasoning is that with the lock, which has acquire-release semantics, the memory coherence hardware would be warned that it needs to fence prior updates. The lock itself (the mutex) can be declared local to the predicate thread, and would then live in local DRAM, which is cheap, and since this is not a lock anything contends for, locking it is probably as inexpensive as a test_and_set, and unlocking is just a write of 0. If the predicate is true, our triggered code body is running after the lock was accessed (probably after the lock release), so we establish the sequential ordering needed to ensure that the hardware will fetch the guarded object using actual memory reads. But every cycle through our predicate testing -- every "spin" -- we end up doing a lock acquire/release on every predicate. So this causes some slowdown.

Option two, seemingly less overhead, also declares the RDMA region as std::atomic with relaxed consistency, but omits the lock and does testing as we do now. Then when a predicate tests true, we would execute an explicit memory-fence (std::memory-order) with semantics. We get the same barrier, but only pay the cost when predicates evaluate to true, hence less overhead.

But now we run into a question of a different kind. Intel has total store order, TSO, and because any thread does some write-then-read actions, Intel is probably forced to fetch the guard variables from memory out of precaution, worrying that TSO could otherwise be violated. C++ with volatile is sure to include the fetch instruction. But on ARM and AMD, is it possible that the hardware itself might stash some guard variable for a very long time in a hardware register or something, causing extreme delays in our "spin-like" loop? Not knowing anything about ARM and AMD, this seems like a worry. But perhaps one of you knows a lot more than I do?

Solution

Well, there seems to be a lack of expertise on this issue at this time. Probably the newness of the std::atomics options and the general sense of uncertainty about precisely how ARM and AMD will implement relaxed consistency make it hard for people to know the answer, and speculation isn't helpful.

As I'm understanding this, the right answers seem to be:

The entire problem won't be seen on Intel because of its TSO (total store order) policy. With TSO, because the guard gets updated after the vector it guards, hence in any total store order, the guard was updated last. Seeing the guard change then guarantees that the receiver will see the updated vector elements. Moreover, the default on AMD and ARM is likely to mimic TSO.
By explicitly declaring the RDMA memory region to have relaxed_consistency, a developer is opting for a cheaper memory model, but taking on the obligation to insert a memory fence. The most obvious way to do this is to just acquire a lock before reading the guard, then release the lock after doing so. This has a cost even if no other thread contends for the lock. First, the lock operation itself requires a few clock cycles. But more broadly, locking a random mutex will have some unknown impact on caches because the hardware must assume that the lock actually is contended for, a wait may have occurred, and values could have changed under its feet. This will result in a cost that needs to be quantified.
Equivalently, the guard can be declared to use acquire_release consistency. Seemingly, this creates a memory fence and the prior updates used to write the vector will be visible to any reader who has seen the guard value change. Again, cost needs to be quantified.
Perhaps, one could do a fenced read at the top of the code block triggered by the predicate. This would get the fence out of the main predicate loop, so the costs of the fence would only be paid once, and only paid when the predicate actually is true.

We also need to tag our atomics as volatile in C++. In fact, C++ probably should notice when a std::atomic type is accessed, and treat that like access to a volatile. However, at present it isn't obvious that C++ compilers are implementing this policy.