c++multithreadingconcurrencylockingstdatomic

What would it take to implement a mutex using std::atomic?


class AtomicLock
{
    std::atomic_flag flag;

public:

    void lock() {
        while (flag.test_and_set(std::memory_order_acquire))
            flag.wait(true, std::memory_order_acquire);
    }

    void unlock() {
        flag.clear(std::memory_order_release);
        flag.notify_one();
    }
};

class ThreadAtomicLock
{
    static std::atomic_long id_generator;
    std::atomic<long> owner_thread = -1;

    static long get_id() {
        thread_local auto id = id_generator++;
        return id;
    }

public:

    void lock() {
        long expected = -1;
        while (!owner_thread.compare_exchange_weak(expected, get_id(), std::memory_order_acquire)) {
            owner_thread.wait(expected, std::memory_order_acquire);
            expected = -1;
        }
    }

    void unlock() {
        long expected = get_id();
        assert (
            owner_thread.compare_exchange_strong(expected, -1, std::memory_order_release) &&
            "unlock of unowned lock"
        );
        owner_thread.notify_one();
    }
};

std::atomic_long ThreadAtomicLock::id_generator = 0;

void testAtomicLock() {
    std::vector<std::thread> threads;
    std::atomic<bool> flag = false;
    // AtomicLock lock;
    ThreadAtomicLock lock;
    for (int i = 0; i < 100; i++) {
        threads.emplace_back([&] {
            while (!flag.load(std::memory_order_acquire));
            lock.lock();
            std::cout << "Hello";
            std::cout << " from thread ";
            std::cout << std::this_thread::get_id() << " ";
            int n = 10;
            while (n--) std::cout << "=";
            std::cout << "\n";
            lock.unlock();
        });
    }
    flag.store(true, std::memory_order_release);
    for (auto& t : threads) t.join();
}

The provided AtomicLock class works, but I realize this will allow any thread to unlock the lock even if it doesn't own it.

The class ThreadAtomicLock is a trial of making sure only the owner thread can unlock the lock.

  1. Though it works fine, I doubt that that's all it takes to implement a lock. Is this implementation correct? Can some constraints be loosened?

  2. Can I replace compare_exchange_strong with compare_exchange_weak in the unlocking function?

  3. The ID generation mechanism here stores an additional ID per thread and performs an atomic operation to generate an ID per thread. Though negligible, I'm curious if there is a better way of implementing it.

  4. Can memory_order_relaxed be used in the wait call in the locking function, since we don't need to synchronize further at this point?


Solution

  • (1) Looks correct to me. C++20 .wait and .notify_one/all expose the functionality (like Linux futex) that library mutex implementations take advantage of to get OS-assisted sleep/wake. Some implementations spin-retry a few times before making a system call with .wait.

    The major thing you're missing is a scheme to avoid making any system calls in unlock when there are no other threads waiting. In the uncontended case, this is the difference between a lock/unlock taking maybe 40 cycles on a modern x86 in the best case (uncontended and L1d cache hit on the lock itself and data accessed in the critical section. e.g. if the same thread is taking and releasing the same lock repeatedly) to a few hundred for uncontended involving a cache miss vs. a few thousand clocks for a system call (especially with any system call needing to do Spectre mitigation these days).

    Glibc's pthread mutex functions are open source so you could take a look and see how they have threads wanting the lock encode that into the lock integer before going to sleep. https://codebrowser.dev/glibc/glibc/nptl/pthread_mutex_lock.c.html#211 looks like a relevant comment.

    Ulrich Drepper's 2004 paper Futexes Are Tricky discusses some mutex algorithms of increasing sophistication, with the first being pretty much yours, the second avoiding a system call in unlock when it can be sure there are no waiters. The third mutex implementation refines further to reduce the number of atomic RMW operations in contention situations. I think the futex functionality it uses can be done portably with C++20 wait/notify_one.

    I'm sure there are other ways to go about it; I don't know if current glibc has any tricks not mentioned in this paper or even if it still uses that counter algorithm. Or what other implementations do.


    (2) compare_exchange_weak can fail spuriously due to interrupts or false sharing. You need compare_exchange_strong since you're not using it in a retry loop.


    (3) Since you generate thread IDs once per thread, you don't need this to be a long. int is almost certainly fine. Although I guess with threads being destroyed and re-created, a 32-bit counter could plausibly overflow after a long time without ever having a ridiculous number of threads.
    So long is not big enough, probably better to use uintptr_t to get 64-bit counters on basically all 64-bit systems, including Windows x64 where long is a 32-bit type.
    Using 32-bit counters on 32-bit systems seems reasonable-ish, a good tradeoff for keeping the lock word only one pointer-width wide which basically all modern systems support natively for atomic RMWs. (Most also support 2-pointer width atomic RMWs, but not all.)

    Looking again at Glibc's implementation of pthreads, the same function I linked before uses pid_t id = THREAD_GETMEM (THREAD_SELF, tid); to read a thread-ID from thread-local storage. (Maybe with one fewer levels of indirection than normal thread_local variables? I'd guess the TID might be stored right in the thread info block, so on x86-64 GNU/Linux could probably be read with an addressing-mode like [fs: 0x40] or some similar small offset from the segment base. If that thread ID is from the OS, it's already a unique 16 or 32-bit integer, guaranteed unique among currently-running threads.


    (4) Yes, wait(relaxed) is fine here. Stronger orders are useful when you can just start reading data produced by a thread that notified you, without first doing an atomic operation on a variable it also wrote.


    Related: