Do we need a memory acquire barrier for one-shot spinlocks?

For locks (spin locks, mutex), we usually need to add acquire fence when locking to ensure the functionality of the lock.

But is this necessary for one-shot spin locks? For example:

int val = 0;
atomic_int lock = 0;

void thread0(void)
{
    int tmp = 0;
    if (atomic_compare_exchange_strong_explicit(&lock, &tmp, 1, memory_order_relaxed, memory_order_relaxed)) { // do we need memory_order_acquire here ?
        assert(!val); // will it always success?
        val = 1;
    }
}

// same as thread0
void thread1(void)
{
    int tmp = 0;
    if (atomic_compare_exchange_strong_explicit(&lock, &tmp, 1, memory_order_relaxed, memory_order_relaxed)) {
        assert(!val);
        val = 1;
    }
}

More specifically, is the following code correct on the armv7-a architecture(There may be some differences from the C code mentioned above):

val:
    .long 0
lock:
    .long 0

core0:
    mov r0, #val
    mov r1, #lock
    mov r4, #1
2:
    ldrex r2, [r1]
    cmp r2, #0
    beq 1f
    bx  lr  // ret
1:
    strex r3, r4, [r1]
    cmp r3, #0
    bne 2b

    // without acquire fence
    ldr r5, [r0] // is r5 != 0 allowed?



core1:
    mov r0, #val
    mov r1, #lock
    mov r4, #1
2:
    ldrex r2, [r1]
    cmp r2, #0
    beq 1f
    bx  lr  // ret
1:
    strex r3, r4, [r1]
    cmp r3, #0
    bne 2b

    dmb ish  // acquire fence
    str r4, [r0]  // store 1

A more specific example (do work and do clean should not have race) :

#define EXIT_FLAG 1
#define WORK_FLAG 2

atomic_int state = 0;

void thread0(void)
{
    int tmp;
    while (1) {
        tmp = 0;
        if (!atomic_compare_exchange_strong_explicit(&state, &tmp, WORK_FLAG, memory_order_relaxed, memory_order_relaxed)) { // do we need acquire here?
            assert(tmp == EXIT_FLAG);
            return;
        }

        // do work

        tmp = WORK_FLAG;
        if (!atomic_compare_exchange_strong_explicit(&state, &tmp, 0, memory_order_release, memory_order_relaxed)) {
            assert(tmp == (EXIT_FLAG | WORK_FLAG));
            // do the clean
            return;
        }
    }
}

void thread1(void)
{
    int tmp = 0;

    while (1) {
        if (atomic_compare_exchange_strong_explicit(&state, &tmp, tmp | EXIT_FLAG, memory_order_acquire, memory_order_relaxed)) // we need acquire here to fit with release in thread0
            break;
    }

    if (!(tmp & WORK_FLAG)) {
        // do the clean
    }
}

Solution

This section is about the first code block, where both threads do
if (lock.CAS_strong(0, 1, relaxed)){ assert(val==0); val=1; }

The assert (in the first code block) always succeeds because only one thread or the other ever runs the if body, and it's sequenced before the val=1 in the same thread.

The atomic RMW decides which wins the race to be that thread, but it doesn't need to sync with any previous writer to prevent overlap of their "critical sections". (https://preshing.com/20120913/acquire-and-release-semantics/)

You don't have a critical section. The load of val could have happened before the CAS, but that would still be fine because you'd still load the initial value then. And there's no release store so nothing lets other threads know that your val update is complete.

I wouldn't call it a one-shot spinlock, that has potentially misleading implications like that you'll sync-with the lock variable and that other threads can see when you're done. (BTW, lock needs to bet atomic_int aka _Atomic int not plain int.)

This is a bit like a guard variable for a non-constant initializer for a static variable which is one-shot for the whole program, although that actually does still need acquire, unfortunately, unless you can separate the case where one thread very recently finished init vs. cases where we've already synced-with the init. Maybe a third value for the guard variable and something like a global all-threads membarrier() system call that runs itself on all cores / threads.

Second code block

The new code has one thread running an infinite loop around try_lock() / do work / unlock. The other thread spinning on a CAS(acquire) to set another bit that will make the first thread stop, and to sync with its release store.

But there are no shared variables other than atomic_int state;, so there's no difference between relaxed and release/acquire. Operations on a single object don't reorder with each other within the same thread even for relaxed. seq_cst would forbid store-forwarding where one thread sees its own stores before they're globally visible (to all threads).

The first loop will exit (via one if or the other) on the first failed CAS_strong. So as soon as the second thread succeeds at its CAS to set the second bit.

This isn't a locking algorithm: the second thread can succeed at setting EXIT_FLAG while the first thread is inside its "critical section", i.e. while WORK_FLAG is set in state. No strengthening of the memory_order parameter to any of the operations can change this.

Without acquire in the first thread's CAS which "takes the lock" (setting WORK_FLAG), later operations can become visible to other threads before state changes. That's a problem for a normal lock, but this is far enough from being a lock that it's not obvious what exactly do work and do the clean are supposed to be.

Only one thread will run do the clean; either the first or second thread depending on when the second thread succeeds at a CAS.