c++performancec++20atomicspinlock

Which spinlock method is more efficient: retry test_and_set(), or spin read-only on test()?


Which spinlock method is better (in terms of efficiency)?

#include <atomic>

#define METHOD 1


int main( )
{
    std::atomic_flag lock { };

#if METHOD == 1
    while ( lock.test_and_set( std::memory_order_acquire ) )
    {
        while ( lock.test( std::memory_order_relaxed ) );
    }
#else
    while ( lock.test_and_set( std::memory_order_acquire ) );
#endif

    lock.clear( std::memory_order_release );
}

This example comes from cppreference. What happens when we add/remove the call to test(std::memory_order_relaxed) inside the outer loop?

I see a noticeable difference in generated code between the two methods (here).


Solution

  • Generally the version that spins read-only on .test() is best, instead of stealing ownership of the cache line from the thread that's trying to unlock it. Especially if the spinlock is in the same cache line as any other data, like data the lock owner might be just reading, you're creating even more and worse false-sharing this way.

    Also, if multiple threads are spin-waiting on the same spinlock, you don't want them wasting bandwidth on the interconnect between cores ping-ponging the cache line containing the lock. (If multiple threads spinning happens at all often, a pure spinlock is usually a bad choice. Normally you'd want to eventually yield the CPU to another thread via OS-assisted sleep/wake, e.g. via futex. C++20 .wait() and .notify_one() can do this, or just use a good implementation of std::mutex or std::shared_mutex.).

    See for more details:


    Unfortunately C++ lacks a portable function like Rust's core::hint::spin_loop which will compile to a pause instruction on x86, or equivalent on other ISAs.

    So a read-only loop will waste more execution resources on a CPU with hyperthreading (stealing them from the other logical core), but waste fewer store-buffer entries and less off-core traffic if anything else is even reading the line. Especially if you have multiple threads spin-waiting on the same lock, ping-ponging the cache line!

    If you don't mind a #ifdef __amd64__ / #include <immintrin.h> for _mm_pause(), then you can have that advantage, too.