c++c++11x86atomiccompare-and-swap

Does compare_exchange based loop benefit from pause?


In a CAS based loop, for example the one below, is the use of pause beneficial on x86?

void atomicLeftShift(atomic<int>& var, int shiftBy)
{
    while(true) {
        int oldVal = var;
        int newVal = oldVal << shiftBy;
        if(var.compare_exchange_weak(oldVal, newVal))
            break;
        else
            _mm_pause();
    }
}

Solution

  • No, I don't think so. This isn't a spin-wait. It's not waiting for another thread to store a 0 or something. It does make sense to try again right away after lock cmpxchg fails, rather than sleep ~100 cycles (on Skylake and later) or ~5 cycles (on earlier Intel CPUs).

    For lock cmpxchg to complete at all (success or fail) means that the cache line is now in Modified (or maybe just Exclusive?) state on this core, so right now is the perfect time to try again.

    Real use cases for lockless atomics are usually not extremely heavily contended, or else you should normally use a fallback to an OS-assisted sleep/wake.

    (But if there is contention, there is hardware arbitration for locked instructions; in a highly contended case I don't know if it's likely for a core to get to execute a 2nd locked instruction before losing the cache line again. But hopefully yes.)

    lock cmpxchg can't fail spuriously, so actual livelock is impossible: at least one core will make progress by having its CAS succeed in an algorithm like this, for a round of every core having a go. On a LL/SC architecture, compare_exchange_weak can fail spuriously, so portability to non-x86 might need to care about livelock, depending on implementation details, but I think even so it's unlikely. (And of course _mm_pause specifically is x86 only.)


    Another reason for using pause is to avoid a memory-order mis-speculation when leaving a spin-wait loop that spins read-only waiting to see the lock unlocked before trying to atomically claim it. (This is better than spinning on xchg or lock cmpxchg and having all waiting threads hammering on the cache line.)

    But that's again not an issue here because the retry loop already includes lock cmpxchg which is a full barrier as well as atomic RMW, so I think that avoids memory-order mis-speculation.


    Especially if you write the loop efficiently / correctly to use the load result of cmpxchg failure when retrying, removing the pure load of var from the loop.

    This is the canonical way to construct an arbitrary atomic operation out of a CAS primitive. compare_exchange_weak updates its first arg if the compare fails, so you don't need another load inside the loop.

    #include <atomic>
    
    int atomicLeftShift(std::atomic<int>& var, int shiftBy)
    {
        int expected = var.load(std::memory_order_relaxed);
        int desired;
        do {
            desired = expected << shiftBy;
        } while( !var.compare_exchange_weak(expected, desired) );  // seq_cst
        return desired;
    }
    

    compiles with clang7.0 -O3 for x86-64 to this asm, on the Godbolt compiler explorer:

    atomicLeftShift(std::atomic<int>&, int):      
        mov     ecx, esi
        mov     eax, dword ptr [rdi]        # pure load outside the loop
    
    .LBB0_1:                              # do {
        mov     edx, eax
        shl     edx, cl                     # desired = expected << count
        lock            cmpxchg dword ptr [rdi], edx   # eax = implicit expected, updated on failure
        jne     .LBB0_1                   # } while(!CAS)
        mov     eax, edx                  # return value
        ret
    

    The only memory access in the retry-loop is lock cmpxchg, which can't suffer from memory-order mis-speculation. There's no need for pause for that reason.


    There's no need for pause for simple backoff delay either, unless you have lots of contention and want to let one thread do multiple things in a row to the same shared variable to increase throughput. i.e. to have other threads back off in the rare case where cmpxchg fails.

    This only makes sense if it's normal for one thread to do multiple atomic operations in a row on the same variable (or one in the same cache line if you have false-sharing problems), instead of putting more operations into one CAS-retry.

    This is probably rare in real code, but common in a synthetic microbenchmark where you let multiple threads hammer away on a shared variable repeatedly with no other work in between.