cmultithreadingatomicmemory-modelrelaxed-atomics

C11 atomics: How does a relaxed load interact with a release store on the same variable?


Context: I have been writing a multithreaded program that uses atomics extensively. I've noticed that these atomics are very slow especially on ARM because the compiler inserted too many fences, sometimes even inside of loops. So I want to eliminate unnecessary ones using memory orders.

I've stumbled upon this case, but I'm not sure if it's safe to use a relaxed load or not. Take this simple parameter reading example:

typedef struct {
    big_struct Data;
    _Atomic bool bDataReadDone;
} worker_thread_parameter;

static int WorkerThreadFunction(void* Parameter) {
    // Read Data
    worker_thread_parameter* pWorkerParameter = Parameter;
    big_struct Data = pWorkerParameter->Data;

    // Notify that reading Data is done
    // Use release store to ensure Data is read before this.
    atomic_store_explicit(&pWorkerParameter->bDataReadDone, true, memory_order_release);
        
    // Do something with Data
}

int main() {
    thrd_t aWorkerThread[8];
    for (size_t i = 0; i < 8; ++i) {
        worker_thread_parameter WorkerParameter = { /* Data = something */, false };
        thrd_create(&aWorkerThread[i], WorkerThreadFunction, &WorkerParameter);

        // Wait for Data to be read
        // Use relaxed load because this thread doesn't read Data anymore,
        // so we don't need to synchronize with the flag.
        while (!atomic_load_explicit(&WorkerParameter.bDataReadDone, memory_order_relaxed));
    }
}

Or this example:

// Initialized before the threads are started
_Atomic bool bUsingData = true;
big_struct* pData = malloc(sizeof(*pData));

static int WorkerThread() {
    Use(pData);

    // Notify the cleaner thread to free Data
    // Use release store to ensure Data is used before this.
    atomic_store_explicit(&bUsingData, false, memory_order_release);
}

static int CleanerThread() {
    // Use relaxed load because this thread doesn't read Data anymore,
    // so we don't need to synchronize with the flag.
    while (atomic_load_explicit(bUsingData, memory_order_relaxed));
    free(pData);
}

And this example:

_Atomic int X = 0;
_Atomic int Y = 0;

// Thread 1

atomic_store_explicit(&X, 99, memory_order_relaxed);
atomic_store_explicit(&Y, 1, memory_order_release);

// Thread 2

if (atomic_load_explicit(&Y, memory_order_relaxed)) {
    atomic_store_explicit(&X, 100, memory_order_relaxed);
    printf("%i", atomic_load_explicit(&X, memory_order_relaxed));
}

// Does thread 2 always prints 100?

Solution

  • Your relaxed loads don't create a happens-before with the release-store, so in the ISO standard's memory model for your last example, the X=100 (relaxed) store could end up before the X=99 (relaxed) store in X's modification order, with the reload able to see either value depending on timing.


    I don't think that's possible on real hardware, though (i.e. that example 2 could only print 100, or nothing if the if isn't taken). Because stores can't commit to coherent cache until they're non-speculative. On an OoO exec CPU, that means all earlier instructions must have already retired from the ROB (Reorder Buffer).
    Loads can retire as soon as they're known to be non-faulting (on ISAs with weak memory models like ARM but not x86), but the branch to implement the if has to wait for the load result before it can execute to check the prediction. On an in-order CPU, speculative exec past a branch wouldn't happen in the first place.

    So on real hardware, I think a release store to order the stores relative to each other is sufficient for your last example.

    The X.store(100, relaxed) can of course execute locally (writing its store to the store buffer), and the reload by the same thread can read that value from the store buffer (store-forwarding) as the CPU speculates into the if body, but a store sitting in this core's store buffer is later in the modification-order than any that have already committed to cache.

    I'm not sure if cross-SMT store-forwarding can cause any interesting effects here (e.g. on POWER, or NVidia ARMv7). I don't think so; even if thread 2 is seeing X and Y values via store-forwarding from another logical core, I think this logical core still has to treat its own stores to X as newer than ones from other cores. Otherwise acquire wouldn't work.


    Fun fact: Linux's memory model is based on explicit barriers like smp_rmb() being a read memory barrier for shared memory between cores. (As opposed to full rmb() being a read-memory-barrier including for I/O accesses.)

    https://github.com/torvalds/linux/blob/master/tools/memory-model/Documentation/control-dependencies.txt documents that a control dependency (such as your if()) does guarantee LoadStore ordering when the load is the branch condition like in your case. Linux's memory model is based on things that are true across all the ISAs it cares about.

    However, stores are not speculated. This means that ordering is (usually) guaranteed for load-store control dependencies, as in the following example:

    I think the "usually" is because of more complex possible example, not because of some machines that don't guarantee this example, but anyone planning to depend on it should read the whole document.

    So yes I'm pretty sure the asm your C compiles to will be safe, but it's not guaranteed in the C abstract machine.