I am writing some (toy) code to better understand concurrency/multiprocessing, etc.
I wrote the following C code as a mutual exclusion algorithm[1] which simply increments a shared int
twice in two different threads, using atomic flag variables to ensure mutual exclusion:
#include <stdatomic.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <threads.h>
#define mo_relaxed memory_order_relaxed
#define mo_release memory_order_release
#define mo_acquire memory_order_acquire
#define mo_acq_rel memory_order_acq_rel
#define aload atomic_load_explicit
#define astore atomic_store_explicit
typedef struct {
int counter;
atomic_bool f1;
atomic_bool f2;
} Shared;
static Shared* makeShared() {
Shared* shared = malloc(sizeof(Shared));
if (shared == NULL) { goto shared_fail; }
shared->counter = 0;
atomic_init(&shared->f1, false);
atomic_init(&shared->f2, false);
return shared;
shared_fail:
return NULL;
}
static void freeShared(Shared* const shared) {
free(shared);
}
#define LIMIT 100
static int t1(Shared* const shared) {
astore(&shared->f1, true, mo_release);
while (aload(&shared->f2, mo_acquire)) {
astore(&shared->f1, false, mo_release);
thrd_yield();
astore(&shared->f1, true, mo_release);
}
shared->counter++;
astore(&shared->f1, false, mo_release);
return 0;
}
static int t2(Shared* const shared) {
astore(&shared->f2, true, mo_release);
while (aload(&shared->f1, mo_acquire)) {
astore(&shared->f2, false, mo_release);
thrd_yield();
astore(&shared->f2, true, mo_release);
}
shared->counter++;
astore(&shared->f2, false, mo_release);
return 0;
}
int main() {
thrd_t t1_t, t2_t;
Shared* shared = makeShared();
if (shared == NULL) { return 1; }
thrd_create(&t1_t, (int (*)(void*))t1, shared);
thrd_create(&t2_t, (int (*)(void*))t2, shared);
thrd_join(t1_t, NULL);
thrd_join(t2_t, NULL);
printf("%d\n", shared->counter);
freeShared(shared);
return 0;
}
I expected the release/acquire actions on the store/loads to the flag variables to ensure mutual exclusion and correct incrementing of the counter
variable, however fuzzing the code revealed that about 0.0000625% of the time, counter
only gets incremented once.
An inspection of the (cleaned-up and annotated[2]) generated assembly revealed this:
t2:
pushq %rbp
movq %rdi, %rbp
pushq %rbx
leaq 5(%rdi), %rbx
subq $8, %rsp ; rsp -= 8 ;; stack alignment for thrd_yield?
movb $1, 5(%rdi) ; shared->f2 = 1
jmp .attempt_lock
.yield_briefly:
movb $0, (%rbx) ; shared->f2 = 0
call thrd_yield@PLT
movb $1, (%rbx) ; shared->f2 = 1
.attempt_lock:
movzbl 4(%rbp), %eax ; eax = shared->f1
testb %al, %al
jne .yield_briefly ; if eax != 0 goto yield_briefly
addl $1, 0(%rbp) ; shared->count++
xorl %eax, %eax
movb $0, 5(%rbp) ; shared->f2 = 0
addq $8, %rsp
popq %rbx
popq %rbp
ret
It seems that the store at line 17 has no lock
or memory fence to ensure that it is propagated to other threads before the store at line 18, this is despite the source-level shared->count++
statement preceding, lexically the store astore(&shared->f2, false, mo_release)
, the latter of which should ensure that all stores that happen-before it are propagated to any thread that performs an acquire operation on the same atomic variable. At least this was my impression from reading the below snippet on cppreference.com:
Release-Acquire ordering
If an atomic store in thread A is tagged memory_order_release, an atomic load in thread B from the same variable is tagged memory_order_acquire, and the load in thread B reads a value written by the store in thread A, then the store in thread A synchronizes-with the load in thread B.
All memory writes (including non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory. This promise only holds if B actually returns the value that A stored, or a value from later in the release sequence.
My question is: how have I misunderstood/misused the C memory model here, why does my release operation on a shared (atomic) flag not correctly propagate stores to shared (non-atomic) across threads?
System Target Triple: x86_64-pc-linux-gnu
Compiler Version: gcc (GCC) 15.1.1 20250425
Compiler Command: gcc -O3 -Wall -Wextra -fsanitize=address
[1] I am aware that this is not an ideal mutual exclusion algorithm due to the possibility of unbounded-waiting, this code is simply to learn about concurrency by incrementally building up to well-known algorithms
[2] Unedited assembly: https://pastebin.com/vUbQG4UA
Let's see the start of your lock sequence:
astore(&shared->f1, true, mo_release);
while (aload(&shared->f2, mo_acquire)) {
First line has a release semantics, which ensures writes made before it will be seen before it (there are no writes before though). Second line has a acquire semantics, which ensures no following loads will be made before it. Importantly, first line allows following loads to be made before it, and second line allows preceding stores to be made after it. So, these two operations may be reordered. This may cause both threads to read the initial value of both f1 and f2 and proceed to stamp on each other while updating non-atomic counter variable. It doesn't matter what memory ordering is on the final unlock because the damage happens before that.
What you need is for store in first line to be visible before load in the second line, i.e. you need StoreLoad barrier. A load must be a sequentially consistent operation (and at least both of the "true" stores as well, as per C++ rules).