Suppose I have an application with multiple threads that need to access some shared data.
I know that a mutex (Critical Section) can be used to ensure that at most one thread at a time can access the shared data. This prevents the case that one thread reads the shared data while another thread is currently modifying the shared data, which otherwise could lead to reading an inconsistent state. The mutex also prevents the case that multiple threads are modifying the shared data at the same time, which otherwise could lead to conflicting writes.
Now, on the machine with multiple processors or multiple CPU cores, there is another problem: If a thread A
running on processor #1 modifies the shared data, and then thread B
running on processor #2 reads the shared data, thread B
may still "see" some outdated data – even if mutual exclusion was enforced. That is because, in general, each processor (core) has its own local CPU cache. So, unless the caches are flushed, even after the shared data was modified in main memory (RAM) from one CPU, the other CPUs can still have an "old" version in their local caches!
To my understanding, a mutex alone does not fix this caching issue, because a mutex is all about enforcing a specific ordering of the accesses, but, in general, a mutex does nothing regarding the CPU caches. After all, there is not even a way to "bind" a mutex to certain memory addresses that would need to be flushed from the CPU caches when the mutex is acquired or released.
So, how do we deal with this "cache synchronization" problem in practice?
And how exactly do memory fences fit into the picture? I know they exist to prevent re-ordering of memory access, but, again, it is not clear how this effects multiple CPUs. Do I need to combine a mutex with a memory fence to prevent a thread from reading outdated data from local cache?
I have read mutex documentation (e.g. Enter/LeaveCriticalSection()
), but it does not clarify how the mutex functions will interact with the processor caches...
Also, I have read that one can allocate memory pages, e.g. via VirtualAlloc()
function, with the special PAGE_NOCACHE flag, so that those pages will not be cached at all. But is this really necessary, whenever data needs to be shared across processor cores ???
pthread_mutex_lock()
/ unlock or C11 mtx_lock
or whatever are responsible for using acquire and release operations when taking/releasing the lock to control the visibility-order of stuff in the critical section so it stays between the lock/unlock. Hardware maintains cache coherence on its own so ordering local accesses to L1d cache is sufficient for mutex functions.
(Yes, this is true for all CPUs we run threads across. If you have CPUs that share memory but without cache coherence, you don't run threads of the same program across those cores, like a DSP and a microcontroller.)
On a hypothetical machine without hardware cache coherence, lock
/unlock
would probably have to manually fully flush caches. (Since the mutex API doesn't doesn't provide it with address ranges for shared data.)
In no case do you have to call MemoryBarrier()
or atomic_thread_fence(release)
or _mm_mfence()
yourself, as long as you're just using standard mutex functions correctly, not lock-free atomics.
If you look at C++'s std::mutex
operations, or the C11 equivalent, they are formally documented as being release and acquire operations which makes it 100% clear they work between threads to create a happens-before relationship. (This usage for mutexes is where the names release
and acquire
for lock-free operations and memory barriers came from. See Jeff Preshing's excellent article which I also linked earlier)
If you look at the POSIX docs, they also talk about "release" and "acquire" of mutexes. Almost certainly there's a formal definition somewhere of those terms in a way that guarantees they work between threads even if those threads are running concurrently on an SMP machine. If you don't realize the significance of "release" and "acquire" as technical terms, perhaps you missed it.
Somewhat related: How does a mutex lock and unlock functions prevents CPU reordering? also covers compile-time reordering by being a non-inline function. (Or if it was fully defined in headers with atomic ops, then those atomic ops would limit reordering sufficiently.)