What makes the boost::shared_mutex so slow

I used google benchmark to run following 3 tests and the result surprised me since the RW lock is ~4x slower than simple mutex in release mode. (and ~10x slower than simple mutex in debug mode)

void raw_access() {
    (void) (gp->a + gp->b);
}

void mutex_access() {
    std::lock_guard<std::mutex> guard(g_mutex);
    (void) (gp->a + gp->b);
}

void rw_mutex_access() {
    boost::shared_lock<boost::shared_mutex> l(g_rw_mutex);
    (void) (gp->a + gp->b);
}

the result is:

2019-06-26 08:30:45
Running ./perf
Run on (4 X 2500 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
Load Average: 5.35, 3.22, 2.57
-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
BM_RawAccess           1.01 ns         1.01 ns    681922241
BM_MutexAccess         18.2 ns         18.2 ns     38479510
BM_RWMutexAccess       92.8 ns         92.8 ns      7561437

I didn't get the enough information via google, so hope some help here.

Thanks

Solution

I don't know the particulars of how the standard library/boost/etc. implementations differ, although it seems like the standard library version is faster (congrats, whoever wrote it).

So instead I'll try to explain the speed differences between various mutex types on a theoretical level, which will explain why a shared mutex (should) be slower.

Atomic Spin Lock

More-so as an academic exercise, consider the simplest thread-safety "mutex-like" implementation: a simple atomic spin lock.

In essence, this is nothing more than an std::atomic<bool> or an std::atomic_flag. It is initialized to false. To "lock" the mutex, you simply do an atomic compare-and-exchange operation in a loop until you get a false value (i.e. the previous value was false prior to atomically setting it to true).

std::atomic_flag flag = ATOMIC_FLAG_INIT;

// lock it by looping until we observe a false value
while (flag.test_and_set()) ;

// do stuff under "mutex" lock

// unlock by setting it back to false state
flag.clear();

However, due to the nature of this construct, it's what we call an "unfair" mutex because the order of threads that acquire the lock is not necessarily the order in which they began their attempts to lock it. That is, under high contention, it's possible a thread tries to lock and simply never succeed because other threads are luckier. It's very timing-sensitive. Imagine musical chairs.

Because of this, although it functions like a mutex, it's not what we consider a "mutex".

Mutex

A mutex can be thought of as building on top of an atomic spin lock (although it's not typically implemented as such, since they typically are implemented with support of the operating system and/or hardware).

In essence, a mutex is a step above atomic spin locks because it has a queue of waiting threads. This allows it to be "fair" because the order of lock acquisition is (more or less) the same as the order of locking attempts.

If you've noticed, if you run sizeof(std::mutex) it might be a bit larger than you might expect. On my platform it's 40 bytes. That extra space is used to hold state information, notably including some way of accessing a lock queue for each individual mutex.

When you try to lock a mutex, it performs some low-level thread-safety operation to have thread-safe access to the mutex's status information (e.g. atomic spin lock), checks the state of the mutex, adds your thread to the lock queue, and (typically) puts your thread to sleep while you wait so you don't burn precious CPU time. The low-level thread-safety operation (e.g. the atomic spin lock) is atomically released at the same time the thread goes to sleep (this is typically where OS or hardware support is necessary to be efficient).

Unlocking is performed by performing a low-level thread-safe operation (e.g. atomic spin lock), popping the next waiting thread from the queue, and waking it up. The thread that has been awoken now "owns" the lock. Rinse wash and repeat.

Shared Mutex

A shared mutex takes this concept a step further. It can be owned by a single thread for read/write permissions (like a normal mutex), or by multiple threads for read-only permissions (semantically, anyway - it's up to the programmer to ensure it's safe).

Thus, in addition to the unique ownership queue (like a normal mutex) it also has a shared ownership state. The shared ownership state could be simply a count of the number of threads that currently have shared ownership. If you inspect sizeof(std::shared_mutex) you'll find it's typically even larger than std::mutex. On my system, for instance, it's 56 bytes.

So when you go to lock a shared mutex, it has to do everything a normal mutex does, but additionally has to verify some other stuff. For instance, if you're trying to lock uniquely it must verify that there are no shared owners. And when you're trying to lock sharingly it must verify that there are no unique owners.

Because we typically want mutexes to be "fair", once a unique locker is in the queue, future sharing lock attempts must queue instead of acquiring the lock, even though it might currently be under sharing (i.e. non-unique) lock by several threads. This is to ensure shared owners don't "bully" a thread that wants unique ownership.

But this also goes the other way: the queuing logic must ensure a shared locker is never placed into an empty queue during shared ownership (because it should immediately succeed and be another shared owner).

Additionally, if there's a unique locker, followed by a shared locker, followed by a unique locker, it must (roughly) guarantee that acquisition order. So each entry in the lock queue also needs a flag denoting its purpose (i.e. shared vs. unique).

And then we think of the wake-up logic. When you unlock a shared mutex, the logic differs depending on the current ownership type of the mutex. If the unlocking thread has unique ownership or is the last shared owner it might have to wake up some thread(s) from the queue. It will either wake up all threads at the front of the queue who are requesting shared ownership, or a single thread at the front of the queue requesting unique ownership.

As you can imagine, all of this extra logic for who's locking for what reasons and how it changes depending not only on the current owners but also on the contents of the queue makes this potentially quite a bit slower. The hope is that you read significantly more frequent than you write, and thus you can have many sharing owners running concurrently, which mitigates the performance hit of coordinating all of that.