[SOLVED] Can ARM exclusive load-store implementing lock-free atomics?

Can ARM exclusive load-store implementing lock-free atomics?

Eventhough exclusions are cleared on trap boundaries, they appear atomic to application codes in the functional regard.

But I'm not certain about the progress guarantee that'd otherwise be made by true lock-free atomic operations.

I think this could be relevant if compatibility with ARMv8.0 where FEAT_LSE (Large System Extension) is "optional" - this extension includes CAS and a few atomic instructions among others.

Solution

Atomic operations implemented with LDXR/STXR (and their variants) are lock-free, in the sense that if multiple cores execute such operations concurrently, at least one will succeed in finite time. In particular, this suffices for implementing lock-free operations as defined by ISO C++, which indeed only ask that one of the operations should succeed in finite time.

To see why, let's review ARM's conceptual model for exclusive load/store; see Section B2.12, "Synchronization and semaphores", in the A-Profile Architecture Reference Manual, version L.a. Each core ("PE" in their jargon) has a local monitor and a global monitor. Both monitors are set by LDXR, and an STXR succeeds if and only if both monitors are still set (on the same address).

(Technically the global monitor is not defined as a per-PE object; it could be shared by the entire system. But it is capable of tracking one address per PE, which amounts to the same thing.)

The local monitor of a given PE is cleared only by actions taken locally by that PE itself: mainly, taking an exception, or doing something specifically forbidden (unrelated memory accesses, executing system instructions, etc). We'll assume those don't happen, and ignore the local monitor.

The global monitor of a given PE is cleared if another PE stores to the monitored address. Crucially, a load-exclusive by another PE does not clear this PE's global monitor (see B2.12.2.1).

So if PEs A and B are both trying an LDXR/work/STXR sequence on the same address:

One of them (let's say A) does LDXR and sets its monitors.
The other one, core B, does LDXR and sets its monitors. The global monitor of A is unaffected and remains set.
One of them (let's say B) completes its work and attempts STXR. Its global monitor is still set, as nothing has happened to clear it, so STXR succeeds and stores to the target address. B has now made progress.
The store by B clears the global monitor of A, so when it completes its work, its STXR will fail. So A does not make progress.

So one of the two PEs made progress (namely B), and this is sufficient for lock-free. In essence, the only way for any given PE to fail to make progress is if another one succeeded in doing so.

There is also a very explicit Note in B2.12.5:

In the event of repeatedly-contending LoadExcl/StoreExcl instruction sequences from multiple PEs, an implementation must ensure that forward progress is made by at least one PE.

Notes are not normative language for the specification, but this does seem to be an accurate summary of the formal requirements of the section.

Such operations are not wait-free, in the sense that it is not guaranteed that all PEs complete their operations in bounded time. It is possible in principle that, if when two or more cores are repeatedly executing LDXR/STXR sequences, that core A always succeeds and core B always fails. Presumably implementers would be encouraged to have fairness heuristics to minimize the chances of this happening in typical usage patterns, as a quality-of-implementation issue.