In C++, I have two threads. Each thread does a store first on one variable, then a load on another variable, but in reversed order:
std::atomic<bool> please_wake_me_up{false};
uint32_t cnt{0};
void thread_1() {
std::atomic_ref atomic_cnt(cnt);
please_wake_me_up.store(true, std::memory_order_seq_cst);
atomic_cnt.load(std::memory_order_seq_cst); // <-- Is this line necessary or can it be omitted?
futex_wait(&cnt, 0); // <-- The performed syscall must read the counter.
// But with which memory ordering?
}
void thread_2() {
std::atomic_ref atomic_cnt(cnt);
atomic_cnt.store(1, std::memory_order_seq_cst);
if (please_wake_me_up.load(std::memory_order_seq_cst)) {
futex_wake(&cnt);
}
}
Full code example: Godbolt.
If all of the four atomic accesses are performed with sequential consistency, it's guaranteed that at least one thread will see the store of the other thread when performing the load. This is what I want to achieve.
As the futex
-syscall must perform a load of the variable it performs on internally, I'm wondering if I can omit the (duplicated) load right before the syscall.
futex
syscall is guaranteed to read the counter, is it safe to omit the marked line? Is there any guarantee the load inside the syscall occurs with sequential consistency?std::atomic_thread_fence(std::memory_order_seq_cst)
be better, as I'm not needing the value, just a fence?If the answer to the question is architecture-specific, I would be interested in x86_64 and arm64.
Any syscalls is a compiler barrier, like any non-inline function.
Not necessarily full barriers against runtime reordering, though, although they might well be in practice, especially since they usually take long enough that the store buffer would have time to probably drain on its own. (Especially with Spectre and MDS mitigation in place (on x86 getting extra microcode to run to flush stuff), taking many extra cycles between reaching the syscall entry point and actually dispatching to a kernel function.)
atomic_thread_fence
is probably worse, e.g. on x86-64 that would be an extra mfence
or dummy lock
ed operation, while an atomic load would be basically free since it'll normally still be hot in L1d from the xchg
store for seq_cst
.
On AArch64 stlr
/ ldar
is still sufficient: the reload can't happen until the store commits to cache, and is itself an acquire load. So yes it will keep all later loads/stores (including of cnt
by the futex system call) after please_wake_me_up.store
. It should be no worse than a stand-alone full barrier, which would have to drain all previous stores from the store buffer, not just stlr
seq_cst / release stores. Earlier cache-miss stores could potentially still be in flight... except that stlr
is a release store so all earlier loads and stores need to be completed before it can commit.
If anything in the kernel uses an ldar
(instead of ARMv8.3 ldapr
just acquire not seq_cst), then you'd still be safe and more work could get into the pipeline while waiting for the please_wake_me_up.store
to drain from the store buffer. But there's no guarantee that's safe, unfortunately; the futex
man page doesn't say it does a seq_cst load.