I was testing the behavior of the control dependencies in LINUX KERNEL MEMORY BARRIERS, and had a problem with the location of the fence.
I was testing this on AArch64 on a Qualcomm Snapdragon 835, with ARM64 gcc 9.3 and -O3
option on.
Here is the test snippet:
std::atomic<int> x;
std::atomic<int> y;
int r1;
int r2;
void thread1_func() {
r1 = x.load(std::memory_order_relaxed);
if (r1) {
r2 = y.load(std::memory_order_relaxed);
}
}
void thread2_func() {
y.store(42, std::memory_order_relaxed);
x.store(1, std::memory_order_release);
}
void thread3_func() {
if (r2 == 0) {
return;
}
if (r1 == 0) {
printf("r1: %d, r2: %d\n", r1, r2);
}
}
int main() {
while (1) {
// reset
x.store(0);
y.store(0);
r1 = 0;
r2 = 0;
std::thread t1(thread1_func);
std::thread t2(thread2_func);
std::thread t3(thread3_func);
t1.join();
t2.join();
t3.join();
}
return 0;
}
The thread3_func
will enter the if (r1 == 0)
branch (we'll cite it as printf branch below), which is not what I want (thread3 has seen the side effect of thread1 loading y
to r2
by a branch predictor or something).
So I put a fence in thread1_func
as shown below.
void thread1_func() {
r1 = x.load(std::memory_order_acquire);
if (r1) {
r2 = y.load(std::memory_order_relaxed);
}
}
or
void thread1_func() {
r1 = x.load(std::memory_order_relaxed);
if (r1) {
std::atomic_thread_fence(std::memory_order_acquire);
r2 = y.load(std::memory_order_relaxed);
}
}
but these 2 examples did not work!
Until I try the fence below:
void thread1_func() {
r1 = x.load(std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
if (r1) {
r2 = y.load(std::memory_order_relaxed);
}
}
This time thread3_func
will never enter the printf branch.
So my question is: why is it that the first 2 examples cannot work, but only the third one worked correctly?
Suggested by @Peter Cordes, this is the atomic version (turning r1,r2 into atomic<int>
) of example 1:
void thread1_func() {
r1.store(x.load(std::memory_order_acquire), std::memory_order_relaxed);
if (r1.load(std::memory_order_relaxed)) {
r2.store(y.load(std::memory_order_relaxed), std::memory_order_relaxed);
}
}
void thread2_func() {
y.store(42, std::memory_order_relaxed);
x.store(1, std::memory_order_release);
}
void thread3_func() {
if (r2.load(std::memory_order_relaxed) == 0) {
return;
}
if (r1.load(std::memory_order_relaxed) == 0) {
printf("r1: %d, r2: %d\n", r1.load(std::memory_order_relaxed),
r2.load(std::memory_order_relaxed));
}
}
example 2 just differ in thread1_func
:
void thread1_func() {
r1.store(x.load(std::memory_order_relaxed), std::memory_order_relaxed);
if (r1.load(std::memory_order_relaxed)) {
std::atomic_thread_fence(std::memory_order_acquire);
r2.store(y.load(std::memory_order_relaxed), std::memory_order_relaxed);
}
}
and they both never (5,000,000 iter seems enough to prove it) enter the print branch, why is it that the atomic version of example 1,2 never print and the non-atomic version does? Their asm looks no different to me.
non-atomic version of example 1: Godbolt (GCC 9.3)
atomic version of example 1: Godbolt (GCC 9.3)
int r1, r2;
are non-atomic variables. They're named "r" for register, like thread-private local vars that hold load results in a litmus test like this. Making them globals for convenience of printing is fine, and you can I guess think of them as "result" instead of "register" when used this way.
But don't read them until after they're both fully written (since they're non-atomic). So read them in main after .join
on the first 2 threads. Not from a separate thread3 that can race with 1 and 2. That's introducing another possible source of memory-reordering beyond the litmus test you're trying to verify.
And since they're non-atomic, your code has data-race undefined behaviour between thread3 reading r2 (and r1), and thread1 writing them.
If you wanted to make a three-thread litmus test where t3 reads outputs of t1, standard naming would be a,b
or w,z
or something. Not r1,r2
if they're not just pure results. I don't know why you want to introduce the possibility of Load Load reordering for reading r1
and r2
, but that's what you've done (including in the atomic relaxed version).
Once you fixed your UB, you got the expected result. So the only remaining question is what exactly happened in your UB test-case. (And I guess why there isn't LoadLoad reordering in the version that uses relaxed
atomic
stores and t3 loads on r1, r2
. It should be broken as well.)
For the UB, we have to look at the exact asm produced by the compiler version you used; apparently G++ 9.3 -O3
with no other options, targeting AArch64. (Most distros config GCC with -fstack-protector-strong -fPIE
on by default, unlike Godbolt, but I'm going to assume the Godbolt asm matches the asm you tested locally, otherwise what would be the point of linking it.)
I also don't see any obviously-important differences like static reordering of the loads, although of course thread3 is different. Some missed-optimizations with std::atomic
like generating a full pointer in a register instead of using a displacement in the addressing mode are irrelevant for memory reordering. The non-atomic version reuses the r2
load result for printf
, and uses an immediate 0
since the if
only runs if r1 == 0
. The atomic version doesn't optimize away repeated loads or stores. This is all after the r2-load + cbnz
and r1-load + cbnz
over or to a ret
(respectively), so seems unlikely to have an effect even with speculative out-of-order exec.
If the compiler had statically reordered the loads in thread3, that would obviously explain it, but that's not what happened. With current compilers, that could only happen with non-atomic vars; current compilers don't speculate atomic loads, and the second load won't happen if the function returns early.
With the loop in main
destroying and spawning new threads every iteration, there shouldn't be anything tricky there; we have a happens-before between the seq_cst r2 = 0
assignment in main
and the start of any thread code executing. The stlr
vs. str
there should be equivalent on real AArch64.
LoadLoad reordering between thread3's reads is possible (including with std::atomic
relaxed
) but unlikely to happen at runtime in practice since you didn't do anything to put any of your variables in separate cache lines. (e.g. alignas(64)
or 128
; see C++ atomic variable memory order problem can not reproduce LoadStore reordering example)
Same goes for StoreStore reordering in thread3. If the first doesn't commit before the second retires, they're both "graduated" (ready to commit), just waiting for exclusive ownership of the cache line. Most non-x86 ISAs are weakly-ordered and can commit stores to L1d cache out of order.
If runtime reordering was the only cause of printf running for the non-atomic version, we'd expect it to also happen with the atomic r1, r2
version using relaxed
loads (and stores). Unless...
The layout of the globals in memory being different could be relevant: disabling "filter directives" on Godbolt (and adding -g0
to remove the debug-info directives), we see a mysterious extra 4 bytes of padding (.zero 4
) between atomic<int>
vars, but not between them and r2
, or between r2
and r1
. (It's not part of sizeof(r1)
; that's still 4 bytes. And in the source code, I see glibc doing alignas( max(alignof(T), sizeof(T)) ) T __M_i;
in std::atomic_base<T>
.)
.bss
.align 3
.set .LANCHOR0,. + 0 # a reference point for adrp for nearby vars
.type y, %object
.size y, 4
y:
.zero 4
.zero 4
.type x, %object
.size x, 4
x:
.zero 4
# .zero 4 # present only in the atomic_int r2 version
.type r2, %object
.size r2, 4
r2:
.zero 4
# .zero 4 # present only in the atomic_int r1 version
.type r1, %object
.size r1, 4
r1:
.zero 4
(I don't know where that extra 4 bytes is coming from. This is G++ asm output that it will feed to the assembler; I expect that if it was padding for alignment; it would use .p2align
, instead of keeping track of size itself. Unless that's only for code, not data.)
If the non-atomic version has r1
near the end of a cache line and r2
near the start of the next, the extra padding could push them both into the same cache line. This could make runtime reordering not happen in one version where it did in the other.
Checking nm ./a.out
or GDB for symbol addresses could confirm or rule out this hypothesis that r1
and r2
end up in different 64-byte cache lines. And testing with alignas(64)
on each variable can see if that allows the std::atomic
version to also see prints of the case you're expecting not to happen due to LoadLoad reordering of r1 and r2 in t3.