Do we need a memory acquire barrier when loading a pointer from memory?

In general, we need to use memory barriers(preventing compiler or CPU from reordering memory access) to ensure multi-threaded synchronization, such as:

int x = 0;
int y = 0;

void thread1(void)
{
    x = 1;
    atomic_store_explicit(&y, 1, memory_order_release);
}

void thread2(void)
{
    int tmp_y = atomic_load_explicit(&y, memory_order_acquire);
    int tmp_x = x;

    if (tmp_y)
        assert(tmp_x); // must be ok
}

But there is a situation where I am not sure if a memory barrier is needed, which is difficult to describe. Please refer to the code:

int x = 0;
int *px = NULL;

void thread1(void)
{
    x = 1;
    atomic_store_explicit(&px, &x, memory_order_release);
}

void thread2(void)
{
    int *p = atomic_load_explicit(&px, memory_order_relaxed); // Do we need memory_order_acquire here?

    if (p)
        assert(*p); // is it ok?
}

Or another example:

int x = 0;
int y = 0;
int *pv = &y;

void thread1(void)
{
    x = 1;
    atomic_store_explicit(&pv, &x, memory_order_release);
}

void thread2(void)
{
    int *p = atomic_load_explicit(&pv, memory_order_relaxed); // Do we need memory_order_acquire here?
    int val = *p;

    if (p == &x)
        assert(val); // is it ok?
}

My doubts about the last two examples mainly come from two aspects, one is whether the use cases at the C language level pass the assert, and the other is whether these two use cases pass the assert at the hardware level.

I think in x86, this must be passed by assert because x86 does not reorder read operations But in ARM, I noticed that reads with an acquire barrier are followed by a 'dmb isb' instruction, and I am not sure if using relaxed has any impact. The specific assembly code is as follows:

thread1:
    movw    r0, :lower16:x
    mov r1, #1
    movt    r0, :upper16:x
    str r1, [r0]
    movw    r1, :lower16:px
    movt    r1, :upper16:px
    dmb ish
    str r0, [r1]
    bx  lr
                                        @ -- End function
thread2:
    movw    r0, :lower16:px
    movt    r0, :upper16:px
    ldr r0, [r0]
    cmp r0, #0
    beq .LBB1_3

       # do we need an dmb ish here ?

    ldr r0, [r0]
    cmp r0, #0
    bxne    lr

    # assert failed

.LBB1_3:
    bx  lr
                                        @ -- End function

Solution

As mentioned in the other answer, if you load the pointer with memory_order_relaxed, you have a data race, and the behavior of the program is undefined. Formally, you would be unable to prove, according to the rules of the C memory model (C23 5.1.2.5) that the store to x happens before the load of x via the pointer px.

The typical cause of failure would be reordering in thread2, either via compiler optimizations or CPU memory reordering, such that x is loaded before px. You might wonder how this is possible, since the program should not know that it's loading from x until it sees the address loaded from px, so this would seem to violate causality. But it can be done by speculation - the compiler or CPU guesses what the address might be, and proceeds on that assumption, but being willing to start over if it was wrong. So the code could be transformed, either by the compiler or CPU, to the equivalent of the following:

    int tmp = x;
    // some time passes...
    int *p = atomic_load_explicit(&px, memory_order_relaxed);
    if (p == &x)
        assert(tmp);
    else
        assert(*p);

So now it's entirely possible that x was loaded well before the store of 1 from thread1, such that tmp == 0.

Since C wants to allow for transformations like this, it makes your examples UB, so that the compiler / machine are not obligated to make them work in the "expected" way.

Now, in practice, most real-life architectures will not speculate addresses like this, and so this sort of address dependency would be sufficient to impose the desired ordering, at the level of the machine. So if one could ensure the compiler doesn't transform it (which, again, memory_order_relaxed does not ensure), then one would not need to pay the price of an acquire barrier in the load of px.

This is the rationale for memory_order_consume, and your examples are the prototypical use case. Your first example could safely be written as

int *p = atomic_load_explicit(&px, memory_order_consume);
assert(*p); // must pass

because the load of px carries a dependency to *p (5.1.2.5 p14), and this can be used to show that x = 1 happens before *p.

On a platform like ARM64, the compiler would then be able to emit code like

ldr x1, [px] // not ldar nor even ldapr
cbz x1, return
ldr x0, [x1]
bl assert

because the ARM64 spec promises to preserve this dependency. In Section B2.3 of the Architecture Reference Manual (I'm using version K.a), see the definition of "Address dependency" and how it propagates into the definitions of Ordered-before and Completes-before.

(One example of a machine which does not promise to preserve address dependencies is DEC Alpha. See for instance Dependent loads reordering in CPU. On such a system, memory_order_consume actually would need to emit a memory barrier.)

However, this has proved to be very hard to implement in practice, since memory_order_consume can require compilers to refrain from transformations that otherwise seem perfectly innocuous. As an example, you might think it would be safe (if not efficient) for the compiler to instead emit the following code:

    ldr x1, [px]
    cmp x1, &x
    b.eq is_x
    b is_not_x
is_x:
    ldr x0, [x]
    // ...
is_not_x:
    // ...

But you would be wrong, because the address dependency has been replaced by a control dependency, and ARM64 is allowed to speculate a load through a control dependency. Basically, it can predict the conditional branch b.eq as taken, and speculatively perform loads on the other side, all before the load from px has completed.

You can come up with sillier examples, like

// thread 1
    array[0] = 1;
    atomic_store_explicit(&atomic_i, 1, memory_order_release);
// thread 2
    int i = atomic_load_explicit(&atomic_i, memory_order_consume);
    if (i != 0)
        assert(array[i-i]); // must pass

which is well-defined. If the compiler emits naive code like

    ldr x1, [atomic_i]
    cbz x1, skip
    sub x2, x1, x1
    add x3, array, x2, lsl #3
    ldr x0, [x3]

then all is well. But if it does the obvious optimization of i-i into 0, then it might well emit

    ldr x1, [atomic_i]
    cbz x1, skip
    ldr x0, [array]

which is not okay, because there is now only a control dependency between the two instructions, and the machine can reorder the loads.

For this reason, AFAIK no current compiler actually implements memory_order_consume as intended; they all just treat it as an alias for memory_order_acquire. So your above example, even with memory_order_consume, likely will just compile on ARM64 to

    ldar x1, [px]   // or ldapr if so equipped
    ldr x0, x1

and you'll pay the cost of the acquire barrier that you didn't truly need.

It was recently reported that for such reasons, the C++ standard committee intendeds to remove memory_order_consume from C++26, so most likely the C standard will also do so someday. Then you'll have no choice but to use memory_order_acquire.