I have an atomic variable which contains a 16-bytes member variable, and I hope the load/store operation on it will be lock-free, because it could be achieved by cmpxchg16b
.
here I have sample code.
#include <atomic>
#include <iostream>
int pthread_mutex_lock(pthread_mutex_t *mutex) {
std::cout << "in pthread_mutex_lock" << std::endl;
return 0;
}
int main() {
std::atomic<__int128> var;
std::cout << "is_lock_free: " << var.is_lock_free() << std::endl;
var.load();
}
When compiled by g++ test.cpp -latomic
, the result is
is_lock_free: 1
The instruction executed in var.load()
is
0x7ffff7bd6740 push %rbx
0x7ffff7bd6741 xor %ecx,%ecx
0x7ffff7bd6743 mov %rcx,%rbx
0x7ffff7bd6746 sub $0x20,%rsp
0x7ffff7bd674a movq $0x0,0x8(%rsp)
0x7ffff7bd6753 mov 0x8(%rsp),%rdx
0x7ffff7bd6758 mov %fs:0x28,%rax
0x7ffff7bd6761 mov %rax,0x18(%rsp)
0x7ffff7bd6766 xor %eax,%eax
0x7ffff7bd6768 movq $0x0,(%rsp)
0x7ffff7bd6770 lock cmpxchg16b (%rdi)
0x7ffff7bd6775 je 0x7ffff7bd6780
0x7ffff7bd6777 mov %rax,(%rsp)
0x7ffff7bd677b mov %rdx,0x8(%rsp)
0x7ffff7bd6780 mov 0x18(%rsp),%rsi
0x7ffff7bd6785 xor %fs:0x28,%rsi
0x7ffff7bd678e mov (%rsp),%rax
0x7ffff7bd6792 mov 0x8(%rsp),%rdx
0x7ffff7bd6797 jne 0x7ffff7bd679f
0x7ffff7bd6799 add $0x20,%rsp
0x7ffff7bd679d pop %rbx
0x7ffff7bd679e retq
and is_lock_free
is executed as following
cmp $0x10,%rdi
ja 0x7ffff7bd5908 <__atomic_is_lock_free+136>
lea 0x17d3(%rip),%rax # 0x7ffff7bd7064
movslq (%rax,%rdi,4),%rdx
add %rdx,%rax
jmpq *%rax
nopw 0x0(%rax,%rax,1)
test $0x3,%sil
je 0x7ffff7bd58be <__atomic_is_lock_free+62>
and $0x7,%esi
add %rsi,%rdi
cmp $0x8,%rdi
setbe %al
retq
nopl 0x0(%rax)
test $0x1,%sil
jne 0x7ffff7bd58c8 <__atomic_is_lock_free+72>
mov $0x1,%eax
retq
nopl 0x0(%rax)
mov %rsi,%rdx
mov $0x1,%eax
and $0x3,%edx
add %rdi,%rdx
cmp $0x4,%rdx
ja 0x7ffff7bd58a6 <__atomic_is_lock_free+38>
repz retq
xchg %ax,%ax
and $0x7,%esi
sete %al
retq
nopw 0x0(%rax,%rax,1)
xor %eax,%eax
and $0xf,%esi
jne 0x7ffff7bd58dc <__atomic_is_lock_free+92>
mov 0x2047a3(%rip),%eax # 0x7ffff7dda0a0
shr $0xd,%eax
and $0x1,%eax
retq
nopl 0x0(%rax)
xor %eax,%eax
retq
But when compiled by g++ test.cpp -latomic -Wl,-z,now
, the result is
is_lock_free: 1
in pthread_mutex_lock
Now the instruction executed is
0x7ffff7bd6050 push %rbx
0x7ffff7bd6051 mov %rdi,%rbx
0x7ffff7bd6054 sub $0x10,%rsp
0x7ffff7bd6058 callq 0x7ffff7bd5910
0x7ffff7bd605d mov (%rbx),%rax
0x7ffff7bd6060 mov 0x8(%rbx),%rdx
0x7ffff7bd6064 mov %rbx,%rdi
0x7ffff7bd6067 mov %rax,(%rsp)
0x7ffff7bd606b mov %rdx,0x8(%rsp)
0x7ffff7bd6070 callq 0x7ffff7bd5930
0x7ffff7bd6075 mov (%rsp),%rax
0x7ffff7bd6079 mov 0x8(%rsp),%rdx
0x7ffff7bd607e add $0x10,%rsp
0x7ffff7bd6082 pop %rbx
With gdb step in 0x7ffff7bd5910
, it calls pthread_mutex_lock
, it shows load
is implemented by lock.
Why atomic has different behaviour with its output? And How does -Wl,-z,now
cause it?
How could I ensure 16-bytes load/store is lock-free?
My enviornment is
[test@15bf6105d708 test]$> gcc --version
gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[test@15bf6105d708 test]$> uname -a
Linux 15bf6105d708 5.4.119-19-0009.11 #1 SMP Wed Oct 5 18:41:07 CST 2022 x86_64 x86_64 x86_64 GNU/Linux
[test@15bf6105d708 test]$> lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7K62 48-Core Processor
Stepping: 0
CPU MHz: 2595.124
BogoMIPS: 5190.24
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
L3 cache: 16384K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
Yes, under normal circumstances, GCC and libatomic will use cmpxchg16b
for 16-byte atomic objects, provided that the runtime CPU supports it, and will not use locks. This includes all modern x86-64 CPUs. (Sufficiently recent GCC can also use 16-byte AVX loads and stores where available, which are also documented as atomic.)
The exceptions are certain early 64-bit CPUs sold around 2004-2006 which lack cmpxchg16b
support. See How prevalent are old x64 processors lacking the cmpxchg16b instruction? (on SuperUser). When running on such CPUs, libatomic will fall back to a lock-based implementation that does call pthread_mutex_lock
.
Your test is running into bug 60790, affecting GCC 8.3 and earlier, which would cause the fallback implementation to always be selected when -z now
was used. libatomic uses indirect functions to select the correct implementations at runtime, which involve calling a "resolver function" and setting the GOT entry based on its result. Each indirect function has its own resolver, so to avoid all of them having to execute cpuid
, GCC up to 8.3 had a function init_cpuid()
which ran as an ELF constructor. It would execute cpuid
and save the result in internal variables, which the resolvers would then consult. However, under some conditions, such as -z now
, the resolver would run before the init_cpuid()
constructor, and thus would see these internal variables as 0. In particular, the bit indicating cmpxchg16b
support would not be set, so the resolvers would select the fallback lock-based functions.
That explains what you're observing in your test.
The fix in 8.4 and later has each resolver check whether the variable has been initialized, and initialize it if not, so the correct implementation is called even with -z now
.
There's a question of whether this means that 16-byte atomic types are "lock free". Older versions of GCC, apparently including the one you're using, would indeed report is_lock_free() == true
if cmpxchg16b
was available on the runtime CPU. That matches the behavior you saw in your test - the only problem was that due to the aforementioned bug, the cmpxchg16b
code wasn't actually called. (The libatomic function underlying is_lock_free()
was not affected by the bug, as it wasn't implemented as an indirect function; it would simply look at the type size and the value of the cpuid
result variable.)
Since then, the GCC developers decided to change this and unconditionally report is_lock_free() == false
for all 16-byte types on x86-64. The feeling was that programmers expected is_lock_free() == true
to imply that loads and stores were done with "fast" instructions, whereas using cmpxchg16b
for stores could in principle require an unbounded number of loops if there is heavy contention. See Genuinely test std::atomic is lock-free or not and the relevant patch with discussion. So that's what you'll observe when testing this with modern GCC versions; but cmpxchg16b
is still being used.
This change was actually made in GCC 7 and was already in place in GCC 8.3, so perhaps your GCC 8.3 compiler is linking against a libatomic from an even earlier GCC version?