I recently came across some synchronization problems, which led me to spinlocks and atomic counters. Then I was searching a bit more, how these work and found std::memory_order and memory barriers (mfence
, lfence
and sfence
).
So now, it seems that I should use acquire/release for the spinlocks and relaxed for the counters.
x86 MFENCE - Memory Fence
x86 LOCK - Assert LOCK# Signal
What is the machine code (edit: see below) for those three operations (lock = test_and_set, unlock = clear, increment = operator++ = fetch_add) with default (seq_cst) memory order and with acquire/release/relaxed (in that order for those three operations). What is the difference (which memory barriers where) and the cost (how many CPU cycles)?
I was just wondering how bad my old code (not specifying memory order = seq_cst used) really is and if I should create some class atomic_counter
derived from std::atomic
but using relaxed memory ordering (as well as good spinlock with acquire/release instead of mutexes on some places ...or to use something from boost library - I have avoided boost so far).
So far I do understand that spinlocks protect more than itself (but some shared resource/memory as well), so, there must be something that makes some memory view coherent for multiple threads/cores (that would be those acquire/release and memory fences). Atomic counter just lives for itself and only need that atomic increment (no other memory involved and I do not really care about the value when I read it, it is informative and can be few cycles old, no problem). There is some LOCK
prefix and some instructions like xchg
implicitly have it. Here my knowledge ends, I don't know how the cache and buses really work and what is behind (but I know that modern CPUs can reorder instructions, execute them in parallel and use memory cache and some synchronization). Thank you for explanation.
P.S.: I have old 32bit PC now, can only see lock addl
and simple xchg
, nothing else - all versions look the same (except unlock), memory_order makes no difference on my old PC (except unlock, release uses move
instead of xchg
). Will that be true for 64bit PC? (edit: see below) Do I have to care about memory order? (answer: no, not much, release on unlock saves few cycles, that's all.)
#include <atomic>
using namespace std;
atomic_flag spinlock;
atomic<int> counter;
void inc1() {
counter++;
}
void inc2() {
counter.fetch_add(1, memory_order_relaxed);
}
void lock1() {
while(spinlock.test_and_set()) ;
}
void lock2() {
while(spinlock.test_and_set(memory_order_acquire)) ;
}
void unlock1() {
spinlock.clear();
}
void unlock2() {
spinlock.clear(memory_order_release);
}
int main() {
inc1();
inc2();
lock1();
unlock1();
lock2();
unlock2();
}
__Z4inc1v:
__Z4inc2v:
lock addl $1, _counter ; both seq_cst and relaxed
ret
__Z5lock1v:
__Z5lock2v:
movl $1, %edx
L5:
movl %edx, %eax
xchgb _spinlock, %al ; both seq_cst and acquire
testb %al, %al
jne L5
rep ret
__Z7unlock1v:
movl $0, %eax
xchgb _spinlock, %al ; seq_cst
ret
__Z7unlock2v:
movb $0, _spinlock ; release
ret
mfence
in unlock1
)_Z4inc1v:
_Z4inc2v:
lock addl $1, counter(%rip) ; both seq_cst and relaxed
ret
_Z5lock1v:
_Z5lock2v:
movl $1, %edx
.L5:
movl %edx, %eax
xchgb spinlock(%rip), %al ; both seq_cst and acquire
testb %al, %al
jne .L5
ret
_Z7unlock1v:
movb $0, spinlock(%rip)
mfence ; seq_cst
ret
_Z7unlock2v:
movb $0, spinlock(%rip) ; release
ret
x86 has mostly strong memory model, all the usual stores/loads have release/acquire semantics implicitly. The exception is only SSE non-temporal store operations which require sfence
to be ordered as usual. All the read-modify-write (RMW) instructions with the LOCK
prefix imply full memory barrier, i.e. seq_cst.
Thus on x86, we have
test_and_set
can be coded with lock bts
(for bit-wise operations), lock cmpxchg
, or lock xchg
(or just xchg
which implies the lock
). Other spin-lock implementations can use instructions like lock inc
(or dec) if they need e.g. fairness. It is not possible to implement try_lock
with release/acquire fence (at least you'd need standalone memory barrier mfence
anyway).clear
is coded with lock and
(for bit-wise) or lock xchg
, though, more efficient implementations would use plain write (mov
) instead of locked instruction.fetch_add
is coded with lock add
.Removing the lock
prefix will not guarantee atomicity for RMW operations thus such operations cannot be interpreted strictly as having memory_order_relaxed
in C++ view. However in practice, you might want to access atomic variable via faster non-atomic operation when it is safe (in constructor, under lock).
In our experience, it does not really matter which exactly RMW atomic operation is performed they take almost the same number of cycles to execute (and mfence is about x0.5 of a lock operation). You can estimate performance of synchronization algorithms by counting the number of atomic operations (and mfences), and the number of memory indirections (cache misses).