arm gcc: store-store ordering without volatile?

I am trying to use a shared index to indicate that data has been written to a shared circular buffer. Is there an efficient way to do this on ARM (arm gcc 9.3.1 for cortex M4 with -O3) without using the discouraged volatile keyword?

The following C functions work fine on x86:

void Test1(int volatile* x) { *x = 5; }
void Test2(int* x) { __atomic_store_n(x, 5, __ATOMIC_RELEASE); }

Both compile efficiently and identically on x86:

0000000000000000 <Test1>:
   0:   c7 07 05 00 00 00       movl   $0x5,(%rdi)
   6:   c3                      retq   
   7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
   e:   00 00 

0000000000000010 <Test2>:
  10:   c7 07 05 00 00 00       movl   $0x5,(%rdi)
  16:   c3                      retq

However on ARM the __atomic builtin generates a Data Memory Barrier, while volatile does not:

00000000 <Test1>:
   0:   2305            movs    r3, #5
   2:   6003            str     r3, [r0, #0]
   4:   4770            bx      lr
   6:   bf00            nop

00000000 <Test2>:
   0:   2305            movs    r3, #5
   2:   f3bf 8f5b       dmb     ish
   6:   6003            str     r3, [r0, #0]
   8:   4770            bx      lr
   a:   bf00            nop

How do I avoid the memory barrier (or similar inefficiencies) while also avoiding volatile?

Solution

The volatile assignment isn't a release-store, and doesn't even give you StoreStore ordering which might be all you need here.

volatile is basically equivalent to __ATOMIC_RELAXED ordering, except that it prevents compile-time reordering with other volatile accesses. It does not do anything to prevent run-time reordering, which CPU memory models other than x86 do allow. (As for actual atomicity, with narrow enough types you do get atomicity with certain compilers, like GCC and Clang, since the Linux kernel uses volatile this way to roll its own atomics, along with inline asm for fences.)

See also When to use volatile with multi threading? - never, volatile doesn't give you anything you can't get with atomics for the purposes of multi-threading. Use GNU C builtins or C++20 std::atomic_ref with memory_order_relaxed instead of volatile if you need non-atomic access to a variable in other parts of your program. Or more simply use C11 stdatomic.h _Atomic int or C++11 std::atomic<> if you never need to point a plain int* at it.

dmb ISHST is at least a StoreStore barrier, so in asm you could get release semantics wrt. earlier stores but not earlier loads. That isn't sufficient for std::memory_order_release aka __ATOMIC_RELEASE (which also requires LoadStore ordering), so there's no way to get a compiler to use that for you. (None of the ops or fences in https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html map to that).

So unfortunately on ARMv7 and earlier, you need a full barrier (dmb ish) for any standard C / C++ memory_order other than relaxed. ARMv8 fixed that.

With -mcpu=cortex-a53 or other ARMv8 CPUs, stl is available as a release-store even in AArch32 state. So use that to avoid an expensive dmb ish full-barrier for release stores or acquire loads: https://godbolt.org/z/1hzvGMbon

# GCC -O2 -mcpu=cortex-a53      (or -march=armv8-a)
Test2(int*):
        movs    r3, #5
        stl     r3, [r0]       // release store
        bx      lr

Single-core systems

On your single-core Cortex M4, all "threads" will run on the same core, so run-time memory reordering isn't possible. An interrupt leading to a context-switch is equivalent to a signal handler in the C11 / C++11 memory models.

You can use atomic_signal_fence to roll your own same-core-acquire / same-core-release for relaxed loads/stores.

  // writer
 buffer[idx] = xyz;
 atomic_signal_fence(memory_order_release);  // prevent compile-time reordering, no run-time cost
 atomic_store_explicit(&shared_idx, idx, memory_order_relaxed);

  // reader
 int idx = atomic_load_explicit(&shared_idx, memory_order_relaxed);
 atomic_signal_fence(memory_order_acquire);  // prevent compile-time reordering, no run-time cost
 int tmp = buffer[idx];

Porting such code to multi-core by changing atomic_signal_fence to atomic_thread_fence is safe but worse for performance on some ISAs, notably ARMv8 where a separate barrier instruction is expensive, but a release-store operation can just use stl