I'm trying to better understand atomic instructions on the ARM64 architecture.
So I'm testing this simple C code, using MSFT intrinsic (compiled with VS C++ 2022):
long v = 0;
_interlockedbittestandset_acq(&v, 0);
This translated to the following assembly code:
str wzr, [sp] ; store 0 in the variable
mov x10, sp
lbl:
ldaxrb w9, [x10] ; load byte from [x10] into w9, set exclusive access
orr w8, w9, #1 ; w8 = w9 | 1
stxrb wip1, w8, [x10] ; store byte from w8 in [x10], only if exclusive access
cbnz wip1, lbl ; jump back to 'lbl' if wip1 != 0 (or, if we didn't have "exclusive access")
dmb ish ; "acquire" memory barrier, or _acq part of the intrinsic
I have several questions here, if you don't mind:
If we follow documentation for the ldaxrb instruction, it basically loads a byte from the address pointed by x10 into w9 (with zero extension to 32 bits) and marks the address pointed by x10 for "exclusive access". Then stxrb stores the byte from w8 into the memory address pointed by x10, only if we have "exclusive access to the memory address". And if so, sets wip1 to 0, otherwise to 1.
So my question is where is that "exclusive access" information kept?
And also what can change the "exclusive access" that I mentioned above?
Finally, what is the purpose of the dmb ish
instructions after the loop?
As an interesting aside, if I break with a debugger and try to step through that loop above, it seems like the cbnz
instruction always takes the jump back to the lbl
address. Or, in other words, my stepping through that loop with a debugger resets the "exclusive access". Otherwise, how else would you explain such a behavior?
The behavior of the load-exclusive and store-exclusive instructions is explained at length in the ARM A-profile Architecture Reference Manual (I'm reading version DDI0487-K.a), Section B2.17.
In brief, the load-exclusive instruction (I'll just use ldxr
for example) sets a "monitor" on the address, or rather, on the "exclusive reservation granule" that contains it. (Normally, "exclusive reservation granule" is just a fancy term for a single cache line, which indeed is commonly 64 bytes.) Then, the subsequent store-exclusive instruction (stxr
) succeeds only if the monitor is still set. To your question #2, the manual explains all the various events that could clear the monitor and thus cause the stxr
to fail.
Actually, there are two monitors: a "local monitor" that is cleared only by various events happening on the same core that did the ldxr
, and a "global monitor" which can be cleared by events on other cores. Clearing either one causes the stxr
to fail.
To your question #1, the monitor exists as some hardware internal to the CPU; the architecture spec doesn't specify how it is to be implemented, and it isn't something you can directly access architecturally, nor does it appear in the CPU's memory space. Presumably the CPU has some internal registers to track its current state. But there's normally only one monitor per core, so it's not like this requires a large amount of resources. You can only monitor one address at a time. So it's not possible to nest ldxr/stxr
pairs, and you can't use it to perform atomic transactions involving multiple addresses. That's not what it's for.
The monitor-clearing event that most people care about is a store to that address (or another address within the granule) from another core; this clears the global monitor. This behavior is what ensures that the ldxr/stxr
acts as an atomic read-modify-write operation on the address.
However, the local monitor is cleared by, among other things, an eret
return-from-exception instruction. This effectively ensures that if an interrupt or exception occurs in between the ldxr/stxr
, then the stxr
will fail and the section will need to be retried. This behavior means that the ldxr/stxr
section, in addition to being atomic with respect to accesses by other cores, is also atomic with respect to accesses on the same core. So, for instance, you can safely use ldxr/stxr
on memory that might be accessed asynchronously by an interrupt handler running on your own core.
(I believe that on 32-bit ARM, returning from an exception doesn't automatically clear the monitor, and so to be safe, the interrupt or exception handler needs to execute a CLREX
instruction, whose sole purpose is to clear the monitor.)
To your "aside" question: in particular, since stepping in a debugger is implemented by having the CPU trap to the kernel's debug exception handler after every instruction, either using the CPU's single-step mode (see D2.11) or by temporarily replacing the following instruction with a brk
instruction. The kernel would then normally suspend the process being debugged, and do a context switch (on this core) to some other task that needs to run, as well as waking up the debugger process (either on this core or another one) so that it can show you what happened. Even if all this stuff doesn't clear the local monitor of its own accord (e.g. by the kernel or the other tasks doing their own ldxr
), when we eventually return to the debugged process to step another instruction, this is done by an eret
that will certainly clear it.
So indeed, it's not possible to step your debugger through a ldxr/stxr
section and have it succeed. Instead, set a breakpoint following the stxr
and let the process run free until it's hit.
To your question #3: The dmb ish
instruction is a memory barrier, ensuring that all memory accesses that follow your stxrb
will become visible to other cores after the stxrb
store itself, and cannot be reordered with it (which otherwise would be allowed). As was mentioned in the comments, this is not actually needed in order to have acquire semantics under the C++ memory model; it's sufficient that the exclusive load had an acquire barrier (i.e. it's ldaxrb
instead of plain ldxrb
) (Actually, the weaker "Acquire-RCpc" ordering would suffice for C++ acquire, but it doesn't appear that ARM64 provides an exclusive load with this semantic.) In fact, dmb ish
wouldn't even be needed if you'd requested sequentially consistent ordering for this operation; then we could just use stxlrb
, which despite its name, is just enough stronger than normal release ordering for C++ seq_cst
(it can't be reordered with a later ldar
). AFAIK, the only way dmb ish
would actually be needed is if you'd requested a std::atomic_thread_fence(std::memory_order_seq_cst)
.