c++language-lawyermemory-modelstdatomicspinlock

How C++ Standard prevents deadlock in spinlock mutex with memory_order_acquire and memory_order_release?


TL:DR: if a mutex implementation uses acquire and release operations, could an implementation do compile-time reordering like would normally be allowed and overlap two critical sections that should be independent, from different locks? This would lead to a potential deadlock.


Assume a mutex is inmplement on std::atomic_flag:

struct mutex
{
   void lock() 
   {
       while (lock.test_and_set(std::memory_order_acquire)) 
       {
          yield_execution();
       }
   }

   void unlock()
   {
       lock.clear(std::memory_order_release);
   }

   std::atomic_flag lock; // = ATOMIC_FLAG_INIT in pre-C++20
};

So far looks ok, regarding using single such mutex: std::memory_order_release is sychronized with std::memory_order_acquire.

The use of std::memory_order_acquire/std::memory_order_release here should not raise questions at the first sight. They are similar to cppreference example https://en.cppreference.com/w/cpp/atomic/atomic_flag

Now there are two mutexes guarding different variables, and two threads accessing them in different order:

mutex m1;
data  v1;

mutex m2;
data  v2;

void threadA()
{
    m1.lock();
    v1.use();
    m1.unlock();

    m2.lock();
    v2.use();
    m2.unlock();
}

void threadB()
{
    m2.lock();
    v2.use();
    m2.unlock();

    m1.lock();
    v1.use();
    m1.unlock();
}

Release operations can be reordered after unrelated acquire operation (unrelated operation == a later operation on a different object), so the execution could be transformed as follows:

mutex m1;
data  v1;

mutex m2;
data  v2;

void threadA()
{
    m1.lock();
    v1.use();

    m2.lock();
    m1.unlock();

    v2.use();
    m2.unlock();
}

void threadB()
{
    m2.lock();
    v2.use();

    m1.lock();
    m2.unlock();

    v1.use();
    m1.unlock();
}

So it looks like there is a deadlock.

Questions:

  1. How Standard prevents from having such mutexes?
  2. What is the best way to have spin lock mutex not suffering from this issue?
  3. Is the unmodified mutex from the top of this post usable for some cases?

(Not a duplicate of C++11 memory_order_acquire and memory_order_release semantics?, though it is in the same area)


Solution

  • There's no problem in the ISO C++ standard; it doesn't distinguish compile-time vs. run-time reordering, and the code still has to execute as if it ran in source order on the C++ abstract machine. So the effects of m2.test_and_set(std::memory_order_acquire) trying to take the 2nd lock can become visible to other threads while still holding the first (i.e. before m1.reset), but failure there can't prevent m1 from ever being released.

    The only way we'd have a problem is if compile-time reordering nailed down that order into asm for some machine, such that the m2 lock retry loop had to exit before actually releasing m1.

    Also, ISO C++ only defines ordering in terms of synchronizes-with and what can see what, not in terms of re-ordering operations relative into some new order. That would imply some order existed. No such order that multiple threads can agree on is even guaranteed to exist for separate objects, unless you use seq_cst operations. (And a modification order for each object separately is guaranteed to exist.)

    The 1-way-barrier model of acquire and release operations (like the diagram in https://preshing.com/20120913/acquire-and-release-semantics) is a convenient way to think about things, and matches reality for pure-loads and pure-stores on x86 and AArch64 for example. But as far as language-lawyering, it's not how the ISO C++ standard defines things.


    You're reordering a whole retry loop, not just a single acquire

    Reordering an atomic operation across a long-running loop is a theoretical problem allowed by the C++ standard. P0062R1: When should compilers optimize atomics? points out that delaying a store until after a long-running loop is technically allowed by standard's wording of 1.10p28:

    An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

    But a potentially infinite loop would violate that, not being finite in the deadlock case for example, so compilers must not do that.

    It's not "just" a quality-of-implementation issue. A successful mutex lock is an acquire operation, but you should not look at the retry loop as a single acquire operation. Any sane compiler won't.

    (The classic example of something that aggressive atomics optimization could break is a progress bar, where the compiler sinks all the relaxed stores out of a loop and then folds all the dead stores into one final store of 100%. See also this Q&A - current compilers don't, and basically treat atomic as volatile atomic until C++ solves the problem of giving programmers a way to let the compiler know when atomics can/can't be optimized safely.)