c++windowsmultithreadingstlshared-lock

std::shared_mutex::unlock_shared() blocks even though there are no active exclusive locks on Windows


My team has encountered a deadlock that I suspect is a bug in the Windows implementation of SRW locks. The code below is a distilled version of real code. Here's the summary:

  1. Main thread acquires exclusive lock
  2. Main thread creates N children threads
  3. Each child thread
    1. Acquires a shared lock
    2. Spins until all children have acquired a shared lock
    3. Releases the shared lock
  4. Main thread releases exclusive lock

Yes this could be done with std::latch in C++20. That's not the point.

This code works most of the time. However roughly 1 in 5000 loops it deadlocks. When it deadlocks exactly 1 child successfully acquires a shared lock and N-1 children are stuck in lock_shared(). On Windows this function calls into RtlAcquireSRWLockShared and blocks in NtWaitForAlertByThreadId.

The behavior is observed when used std::shared_mutex directly, std::shared_lock/std::unique_lock, or simply calling SRW functions directly.

A 2017 Raymond Chen post asks about this exact behavior, but user error is blamed.

This looks like an SRW bug to me. It's maybe worth noting that if a child doesn't attempt to latch and calls unlock_shared that this will wake its blocked siblings. There is nothing in the documentation for std::shared_lock or *SRW* that suggests is allowed to block even when there is not an active exclusive lock.

This deadlock has not been observed on non-Windows platforms.

Example code:

#include <atomic>
#include <cstdint>
#include <iostream>
#include <memory>
#include <shared_mutex>
#include <thread>
#include <vector>

struct ThreadTestData {
    int32_t numThreads = 0;
    std::shared_mutex sharedMutex = {};
    std::atomic<int32_t> readCounter;
};

int DoStuff(ThreadTestData* data) {
    // Acquire reader lock
    data->sharedMutex.lock_shared();

    // wait until all read threads have acquired their shared lock
    data->readCounter.fetch_add(1);
    while (data->readCounter.load() != data->numThreads) {
        std::this_thread::yield();
    }

    // Release reader lock
    data->sharedMutex.unlock_shared();

    return 0;
}

int main() {
    int count = 0;
    while (true) {
        ThreadTestData data = {};
        data.numThreads = 5;

        // Acquire write lock
        data.sharedMutex.lock();

        // Create N threads
        std::vector<std::unique_ptr<std::thread>> readerThreads;
        readerThreads.reserve(data.numThreads);
        for (int i = 0; i < data.numThreads; ++i) {
            readerThreads.emplace_back(std::make_unique<std::thread>(DoStuff, &data));
        }

        // Release write lock
        data.sharedMutex.unlock();

        // Wait for all readers to succeed
        for (auto& thread : readerThreads) {
            thread->join();
        }

        // Cleanup
        readerThreads.clear();

        // Spew so we can tell when it's deadlocked
        count += 1;
        std::cout << count << std::endl;
    }

    return 0;
}

Here's a picture of the parallel stacks. You can see the main thread is correctly blocking on thread::join. One reader thread acquired the lock and is in a yield loop. Four reader threads are blocked within lock_shared.

enter image description here


Solution

  • This is a confirmed bug in the OS SlimReaderWriter API.

    I posted a thread in r/cpp on Reddit because I knew Reddit user u/STL works on Microsoft's STL implementation and is an active user.

    u/STL posted a comment declaring it an SRW bug. He filed OS bug report" OS-49268777 "SRWLOCK can deadlock after an exclusive owner has released ownership and several reader threads are attempting to acquire shared ownership together". Unfortunately this a Microsoft internal bug tracker so we can't follow it.

    Thanks to commenters in this thread (RbMm in particular) for helping fully explain and understand the observed behavior.

    RbMm posted a secondary answer which appears to show that "AcquireSRWLockShared some time can really acquires a slim SRW lock in exclusive mode". Read his response for details. I think almost everyone would be surprised by this behavior!