[SOLVED] Why does this C code not propagate a write between threads/cores

I am writing some (toy) code to better understand concurrency/multiprocessing, etc.

I wrote the following C code as a mutual exclusion algorithm[1] which simply increments a shared int twice in two different threads, using atomic flag variables to ensure mutual exclusion:

#include <stdatomic.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <threads.h>

#define mo_relaxed memory_order_relaxed
#define mo_release memory_order_release
#define mo_acquire memory_order_acquire
#define mo_acq_rel memory_order_acq_rel
#define aload      atomic_load_explicit
#define astore     atomic_store_explicit

typedef struct {
    int counter;
    atomic_bool f1;
    atomic_bool f2;
} Shared;

static Shared* makeShared() {
    Shared* shared = malloc(sizeof(Shared));

    if (shared == NULL) { goto shared_fail; }

    shared->counter = 0;
    atomic_init(&shared->f1, false);
    atomic_init(&shared->f2, false);

    return shared;

shared_fail:
    return NULL;
}

static void freeShared(Shared* const shared) {
    free(shared);
}

#define LIMIT 100

static int t1(Shared* const shared) {
    astore(&shared->f1, true, mo_release);

    while (aload(&shared->f2, mo_acquire)) {
        astore(&shared->f1, false, mo_release);
        thrd_yield();
        astore(&shared->f1, true, mo_release);
    }

    shared->counter++;
    astore(&shared->f1, false, mo_release);

    return 0;
}

static int t2(Shared* const shared) {
    astore(&shared->f2, true, mo_release);

    while (aload(&shared->f1, mo_acquire)) {
        astore(&shared->f2, false, mo_release);
        thrd_yield();
        astore(&shared->f2, true, mo_release);
    }

    shared->counter++;
    astore(&shared->f2, false, mo_release);

    return 0;
}

int main() {
    thrd_t t1_t, t2_t;

    Shared* shared = makeShared();

    if (shared == NULL) { return 1; }

    thrd_create(&t1_t, (int (*)(void*))t1, shared);
    thrd_create(&t2_t, (int (*)(void*))t2, shared);

    thrd_join(t1_t, NULL);
    thrd_join(t2_t, NULL);

    printf("%d\n", shared->counter);
    freeShared(shared);

    return 0;
}

I expected the release/acquire actions on the store/loads to the flag variables to ensure mutual exclusion and correct incrementing of the counter variable, however fuzzing the code revealed that about 0.0000625% of the time, counter only gets incremented once.

An inspection of the (cleaned-up and annotated[2]) generated assembly revealed this:

t2:
    pushq    %rbp
    movq    %rdi, %rbp
    pushq    %rbx
    leaq    5(%rdi), %rbx
    subq    $8, %rsp        ; rsp -= 8 ;; stack alignment for thrd_yield?
    movb    $1, 5(%rdi)     ; shared->f2 = 1
    jmp    .attempt_lock
.yield_briefly:
    movb    $0, (%rbx)      ; shared->f2 = 0
    call    thrd_yield@PLT
    movb    $1, (%rbx)      ; shared->f2 = 1
.attempt_lock:
    movzbl    4(%rbp), %eax ; eax = shared->f1
    testb    %al, %al
    jne    .yield_briefly   ; if eax != 0 goto yield_briefly
    addl    $1, 0(%rbp)     ; shared->count++
    xorl    %eax, %eax
    movb    $0, 5(%rbp)     ; shared->f2 = 0
    addq    $8, %rsp
    popq    %rbx
    popq    %rbp
    ret

It seems that the store at line 17 has no lock or memory fence to ensure that it is propagated to other threads before the store at line 18, this is despite the source-level shared->count++ statement preceding, lexically the store astore(&shared->f2, false, mo_release), the latter of which should ensure that all stores that happen-before it are propagated to any thread that performs an acquire operation on the same atomic variable. At least this was my impression from reading the below snippet on cppreference.com:

Release-Acquire ordering

If an atomic store in thread A is tagged memory_order_release, an atomic load in thread B from the same variable is tagged memory_order_acquire, and the load in thread B reads a value written by the store in thread A, then the store in thread A synchronizes-with the load in thread B.

All memory writes (including non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory. This promise only holds if B actually returns the value that A stored, or a value from later in the release sequence.

My question is: how have I misunderstood/misused the C memory model here, why does my release operation on a shared (atomic) flag not correctly propagate stores to shared (non-atomic) across threads?

System Target Triple: x86_64-pc-linux-gnu

Compiler Version: gcc (GCC) 15.1.1 20250425

Compiler Command: gcc -O3 -Wall -Wextra -fsanitize=address

[1] I am aware that this is not an ideal mutual exclusion algorithm due to the possibility of unbounded-waiting, this code is simply to learn about concurrency by incrementally building up to well-known algorithms

[2] Unedited assembly: https://pastebin.com/vUbQG4UA

Let's see the start of your lock sequence:

    astore(&shared->f1, true, mo_release);
    while (aload(&shared->f2, mo_acquire)) {

First line has a release semantics, which ensures writes made before it will be seen before it (there are no writes before though). Second line has a acquire semantics, which ensures no following loads will be made before it. Importantly, first line allows following loads to be made before it, and second line allows preceding stores to be made after it. So, these two operations may be reordered. This may cause both threads to read the initial value of both f1 and f2 and proceed to stamp on each other while updating non-atomic counter variable. It doesn't matter what memory ordering is on the final unlock because the damage happens before that.

What you need is for store in first line to be visible before load in the second line, i.e. you need StoreLoad barrier. A load must be a sequentially consistent operation (and at least both of the "true" stores as well, as per C++ rules).