c++multithreading performance optimization x86

compare_exchange_weak() produces race condition with acquire-release on x86

Below I have a WriterLock struct. It's actually Reader-Writer but in this scenario I am only calling writer lock and unlock.

My unit test creates multiple threads, they wait and then request access to increment a global variable.

ThreadSanitizer reports no race conditions if I use compare_exchange_weak() with the default sequentially consistent memory barriers.

However, given I am using x86 if I try and use acquire-release for compare_exchange_weak(), a race condition is detected.

Is it possible to use compare_exchange_weak() here with less-restrictive barriers?

#include <cstdint>
#include <atomic>
#include <limits>  
#include <thread>
#include <vector>

struct WriterLock
{        
    void writer_lock()
    {
        while (true)
        {
            uint32_t prev_readers = _reader_count.load(std::memory_order_acquire);

            if (prev_readers == 0)
            {
                // race condition
                if (_reader_count.compare_exchange_weak(prev_readers, HAS_WRITER, std::memory_order_acquire, std::memory_order_release))

                // works
                //if (_reader_count.compare_exchange_weak(prev_readers, HAS_WRITER))
                {
                    return;
                }
            }
        }
    }

    void writer_unlock()
    {
        while (true)
        {
            uint32_t prev_readers = _reader_count.load(std::memory_order_acquire);

            if (prev_readers == HAS_WRITER)
            {
                // race condition
                if (_reader_count.compare_exchange_weak(prev_readers, 0, std::memory_order_acquire, std::memory_order_release))

                // Works
                //if (_reader_count.compare_exchange_weak(prev_readers, 0))
                {
                    return;
                }
            }
        }
    }

private:
    const uint32_t HAS_WRITER = std::numeric_limits<uint32_t>::max();
    std::atomic<uint32_t> _reader_count{0};
};

uint64_t counter{0};

WriterLock lock;

static const uint64_t num_times = uint64_t(10'000'000);

std::atomic<bool> go{false};

void increment()
{
    while(go == false)
    {

    }

    for(uint64_t j = 0; j < num_times; ++j)
    {
        lock.writer_lock();

        ++counter;
        
        lock.writer_unlock();
    }
}

int main()
{
    std::vector<std::thread> threads;

    const size_t num_threads = 10;

    for(uint64_t t = 0; t < num_threads; ++t)
    {
        threads.push_back(std::thread(increment));
    }

    for(uint64_t t = 0; t < num_threads; ++t)
    {
        threads[t].detach();
    }

    go = true;
}

Solution

Weaker orderings work fine if correctly chosen, but your choices seem to be rather confused, and it's not clear how you came up with them.

First, pay attention to what the two ordering arguments to compare_exchange_weak actually mean. The first argument is the "success" ordering, which applies if the value of the atomic variable equals the expected value, in which case the new value is stored. In this case, we have performed a true read-modify-write operation, so the argument needs to specify the barriers applicable to both the load and the store. The second argument is the "failure" ordering, in case the atomic variable does not equal the expected value, and no new value is stored. In this case, only a load is performed, so the parameter only needs to specify the ordering for a load.

This shows why release does not make sense for the failure ordering, since "release" is applicable only to a store. The C++ standard allows the library to assume that this nonsensical value will never be passed, so it declares it to cause undefined behavior.

But let's think about what we actually want.

Lock function

The initial load that precedes the compare_exchange_weak can be relaxed. It's simply an optimization to check if we should even bother trying to take the lock; it doesn't play a role in the actual acquisition of the lock, and so it needn't be ordered in any particular way with respect to the critical section.

The success ordering on the compare_exchange_weak needs to be at least acquire. Loading the value 0 is proof that the lock was available, and so acquire ordering ensures that the critical section does not proceed until we have that proof in hand. The fact that the compare_exchange succeeded establishes that not only was the lock available, but that moreover we have taken it.

Using memory_order_acquire as the success parameter effectively means that the store side of the compare-exchange is relaxed; it carries no particular ordering constraints. That is fine; again, it's the load side that provided the proof that the lock is ours. Regardless of memory ordering, compare_exchange_weak is still an atomic read-modify-write, so even if the store of the new value HAS_WRITER is delayed somewhat, we are still guaranteed that no other thread can store to it in the meantime. So memory_order_acq_rel or memory_order_seq_cst would also be valid here as they are strictly stronger, but are unnecessary.

The failure ordering value can be relaxed. If the compare-exchange fails, it simply tells us that we did not manage to take the lock. We are not going to enter the critical section at all until we do take it, so no barriers are needed in this case. All we've really accomplished is to load the updated value, and this can be relaxed for the same reasons noted above for the initial load. (With the code as it stands, we don't even use the updated value, but instead start the loop over and redo the initial load; this could probably be optimized better but I won't get into that here.)

So I would write:

    void writer_lock()
    {
        while (true)
        {
            uint32_t prev_readers = _reader_count.load(std::memory_order_relaxed);

            if (prev_readers == 0)
            {
                if (_reader_count.compare_exchange_weak(prev_readers, HAS_WRITER, std::memory_order_acquire, std::memory_order_relaxed))
                {
                    return;
                }
            }
        }
    }

Note that in most cases, if you fail to take the lock, you should do something more polite than just spinning. Sleep for a bit, or yield the CPU in some other fashion. (std::shared_mutex, which you are reinventing here, is able to use system-specific primitives to go to sleep and awaken as soon as the lock becomes available.)

Unlock function

In the unlock function, the same logic as before would say that the initial load can safely be relaxed. However, here it's not clear why we are doing it at all. We can only call writer_unlock() from a thread that already holds a writer lock, so the value of _reader_count must already be HAS_WRITER. I suppose you could check the value basically as an assertion or BUG check, but if so, then if you find that it does not equal HAS_WRITER, you should do something useful to report the bug. As it stands, if that happens then the code will silently enter an infinite loop, which is not helpful.

As for the compare-exchange, the success ordering needed here is release. We have to ensure that the critical section is complete and globally visible before storing the value that indicates that the lock is available. We do not need any particular ordering on the load side of the read-modify-write, as this would only serve to ensure that the non-critical code that follows the unlock does not start too soon. But since it's non-critical and doesn't conflict with any other threads, we don't care.

Your code had acquire here, and that's the root cause of the data race detected by TSAN. There's nothing inherently wrong with acquire ordering on the load side (it's merely unnecessary), but the problem is that memory_order_acquire on a read-modify-write implicitly specifies (by omission) relaxed ordering on the store side. That's bad as it allows the critical section to be reordered after the store, to a time when some other thread may have acquired the lock and be working with the shared data. Changing it to acq_rel, as you discussed in comments, fixes that bug and eliminates the data race, but is still a little inefficient as it imposes acquire ordering on the load, which we don't need.

Also, by the same logic as before, relaxed is sufficient for the failure ordering.

Actually, since as mentioned we already know that the value of _reader_count must be HAS_WRITERS at this point, the compare-exchange is also unnecessary; we could just do an unconditional store. So the whole unlock function can simply be

    void writer_unlock()
    {
        _reader_count.store(0, std::memory_order_release);
    }

Though, when you get around to implementing the reader lock and unlock functions, you will need a compare_exchange in reader_unlock, since we need to know the old value of _reader_count to determine the correct new value (i.e. one less). Here again, the correct orderings will be release on success and relaxed on failure. (Alternatively, since we know that if we hold a reader lock, the value of _reader_count must be a positive integer and not HAS_WRITERS, we could simply do _reader_count.fetch_sub(1, std::memory_order_release).)