java concurrency cpu-architecture atomic compare-and-swap

What exactly makes Compare-and-swap (CAS) loop a better choice in highly concurrent environment?

Assuming we have just 1 cpu core and for the sake of example given the following CAS loop (in Java language, took it from here), although the question is about CAS loop in general not this code in particular:

private AtomicInteger count = new AtomicInteger(0);

public void increment() {
  int current, next;
  do {
    current = count.get();
    next = current + 1;
  } while (!count.compareAndSet(current, next));
}

theoretically, nothing prevents some particular thread from getting stuck in this loop.

For example, after next = current + 1; context switch occurs, some other thread changes the value of the atomic count. Then once this unlucky thread resumes, the expression in the while statement evaluates to true. So, the loop starts again. Yet, after next = current + 1; context switch occurs and things keep repeating so on and on.

So, would it be correct to say that CAS loop is safe to use because it is just a zero-risk bet to expect CPU not to do context switch so often? By safety here I mean some thread won't get stuck in the loop forever. For example, in a mission-critical app (whatever that means), just to emphasize zero tolerance towards unhappy scenarios. If so, how do we know CPU works like that? How to cultivate that intuition of how many operations roughly it usually takes before CPU performs a context switch? In fact, that is the main question that I would like to get an answer for in this scenario.

However if it is not because of that low-risk bet, then what makes us believe the loop will eventually terminate and when? For example, what would be the upper bound of the number of retries of the loop or are there some underlying (os-level) guarantees regarding that?

Solution

Most modern systems are SMP (multi-core and sometimes even multi-socket), so other threads can be running simultaneously with this one. That means CAS can fail even without a context switch by the core running your thread. Actual context-switches happen extremely infrequently compared to how long a CAS retry loop takes, so that's pretty much a non-problem.

See Is incrementing an int effectively atomic in specific cases? for more about how real CPUs handle atomic RMWs, especially on x86.

what makes us believe the loop will eventually terminate and when?

A CAS retry loop is lock-free: at least one thread will make progress every time they all do an iteration. (Of course they don't actually run in lock-step, and a CAS attempt can only happen while holding exclusive ownership of the cache line... See Anything in std::atomic is wait-free? for another take on that.)

Java and other languages don't provide fairness guarantees for their lock-free atomics, but in practice most hardware does try to avoid starving any core of access to a cache line it's waiting for. Its CAS attempt could still fail, but you'd have to be infinitely unlucky for it to fail an infinite number of times in a row with other threads winning the race to do the CAS.

But that's assuming all threads are doing a similar-speed computation between the load and CAS attempt; if many other threads are doing non-stop increments while you're trying to atomically replace x with slow_function(x), you might never succeed.

If there's so much contention that your CAS retry loops often retry more than once, that's not good; lock-free works best when contention is low enough that CAS retries aren't common. So for example, you want to avoid having every thread contending to increment a single shared counter in a tight loop if you can avoid it. Break the work up into regions that are divided by smaller pools of threads, or have each thread claim 16 chunks by doing an atomic += 16 instead of += 1.

But even with high contention, a loop like this tends to degrade fairly gracefully, not completely fall on its face as contention increases. Things would have to get very extreme before you'd start seeing hundreds of CAS failures in a row. Once a core gets ownership of a cache line, it only takes a few nanoseconds (tens of clock cycles) to make a CAS attempt, and latency between cores to move cache lines around is like 50 to 100 nanoseconds on typical CPUs. (The more cores, the more hops in the interconnect.)

On LL/SC machines, CAS itself and other atomic RMWs like getAndIncrement require a retry loop to avoid spurious failures (which is why C++11 compare_exchange_weak exists, a version that's allowed to fail spuriously and thus can be used in retry loops). Livelock with no threads making progress is possible in theory; avoiding that is I think up to CPU architects having cores hang on to cache-line ownership a bit longer, perhaps adaptively noticing that they've failed repeatedly.

Or better, architects providing single-instruction atomic RMWs as an alternative to LL/SC, like ARM did with ARMv8.1. And preferably a rich set of atomic RMWs to directly support methods like getAndIncrement without a CAS retry loop. (CAS retry loops are still often needed, like when some data in an object you're publishing needs to be stored before you CAS a reference to it, or to implement a hypothetical getAndRightShift or whatever.) See https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd/2 for benchmarks of core-to-core round-trip latency averages using LDREX/STREX vs. ARMv8.1 single-instruction CAS on a 64-core system.

For example, what would be the upper bound of the number of retries of the loop or are there some underlying (os-level) guarantees regarding that?

There is none.