c++gccatomicredhatstdatomic

std::atomic::is_lock_free() shows true but pthread_mutex_lock() called


I have an atomic variable which contains a 16-bytes member variable, and I hope the load/store operation on it will be lock-free, because it could be achieved by cmpxchg16b.

here I have sample code.

#include <atomic>
#include <iostream>

int pthread_mutex_lock(pthread_mutex_t *mutex) {
    std::cout << "in pthread_mutex_lock" << std::endl;
    return 0;
}

int main() {
  std::atomic<__int128> var;
  std::cout << "is_lock_free: " << var.is_lock_free() << std::endl;
  var.load();
}

When compiled by g++ test.cpp -latomic, the result is

is_lock_free: 1

The instruction executed in var.load() is

0x7ffff7bd6740  push %rbx
0x7ffff7bd6741  xor    %ecx,%ecx
0x7ffff7bd6743  mov    %rcx,%rbx
0x7ffff7bd6746  sub    $0x20,%rsp
0x7ffff7bd674a  movq   $0x0,0x8(%rsp)
0x7ffff7bd6753  mov    0x8(%rsp),%rdx
0x7ffff7bd6758  mov    %fs:0x28,%rax
0x7ffff7bd6761  mov    %rax,0x18(%rsp)
0x7ffff7bd6766  xor    %eax,%eax
0x7ffff7bd6768  movq   $0x0,(%rsp)
0x7ffff7bd6770  lock cmpxchg16b (%rdi)
0x7ffff7bd6775  je     0x7ffff7bd6780
0x7ffff7bd6777  mov    %rax,(%rsp)
0x7ffff7bd677b  mov    %rdx,0x8(%rsp)
0x7ffff7bd6780  mov    0x18(%rsp),%rsi
0x7ffff7bd6785  xor    %fs:0x28,%rsi
0x7ffff7bd678e  mov    (%rsp),%rax
0x7ffff7bd6792  mov    0x8(%rsp),%rdx
0x7ffff7bd6797  jne    0x7ffff7bd679f
0x7ffff7bd6799  add    $0x20,%rsp
0x7ffff7bd679d  pop    %rbx
0x7ffff7bd679e  retq     

and is_lock_free is executed as following

cmp    $0x10,%rdi                               
ja     0x7ffff7bd5908 <__atomic_is_lock_free+136>
lea    0x17d3(%rip),%rax        # 0x7ffff7bd7064
movslq (%rax,%rdi,4),%rdx                       
add    %rdx,%rax
jmpq   *%rax    
nopw   0x0(%rax,%rax,1)                         
test   $0x3,%sil
je     0x7ffff7bd58be <__atomic_is_lock_free+62>
and    $0x7,%esi
add    %rsi,%rdi
cmp    $0x8,%rdi
setbe  %al      
retq            
nopl   0x0(%rax)
test   $0x1,%sil
jne    0x7ffff7bd58c8 <__atomic_is_lock_free+72>
mov    $0x1,%eax
retq
nopl   0x0(%rax)
mov    %rsi,%rdx
mov    $0x1,%eax
and    $0x3,%edx
add    %rdi,%rdx
cmp    $0x4,%rdx
ja     0x7ffff7bd58a6 <__atomic_is_lock_free+38>  
repz retq       
xchg   %ax,%ax  
and    $0x7,%esi
sete   %al      
retq            
nopw   0x0(%rax,%rax,1)                           
xor    %eax,%eax
and    $0xf,%esi
jne    0x7ffff7bd58dc <__atomic_is_lock_free+92>  
mov    0x2047a3(%rip),%eax        # 0x7ffff7dda0a0
shr    $0xd,%eax
and    $0x1,%eax
retq            
nopl   0x0(%rax)
xor    %eax,%eax
retq

But when compiled by g++ test.cpp -latomic -Wl,-z,now, the result is

is_lock_free: 1
in pthread_mutex_lock

Now the instruction executed is

0x7ffff7bd6050  push   %rbx
0x7ffff7bd6051  mov    %rdi,%rbx
0x7ffff7bd6054  sub    $0x10,%rsp
0x7ffff7bd6058  callq  0x7ffff7bd5910
0x7ffff7bd605d  mov    (%rbx),%rax
0x7ffff7bd6060  mov    0x8(%rbx),%rdx
0x7ffff7bd6064  mov    %rbx,%rdi
0x7ffff7bd6067  mov    %rax,(%rsp)
0x7ffff7bd606b  mov    %rdx,0x8(%rsp)
0x7ffff7bd6070  callq  0x7ffff7bd5930
0x7ffff7bd6075  mov    (%rsp),%rax
0x7ffff7bd6079  mov    0x8(%rsp),%rdx
0x7ffff7bd607e  add    $0x10,%rsp
0x7ffff7bd6082  pop    %rbx   

With gdb step in 0x7ffff7bd5910, it calls pthread_mutex_lock, it shows load is implemented by lock.

Why atomic has different behaviour with its output? And How does -Wl,-z,now cause it?

How could I ensure 16-bytes load/store is lock-free?

My enviornment is

[test@15bf6105d708 test]$> gcc --version
gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[test@15bf6105d708 test]$> uname -a
Linux 15bf6105d708 5.4.119-19-0009.11 #1 SMP Wed Oct 5 18:41:07 CST 2022 x86_64 x86_64 x86_64 GNU/Linux

[test@15bf6105d708 test]$> lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7K62 48-Core Processor
Stepping:              0
CPU MHz:               2595.124
BogoMIPS:              5190.24
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              16384K
NUMA node0 CPU(s):     0-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat

Solution

  • Yes, under normal circumstances, GCC and libatomic will use cmpxchg16b for 16-byte atomic objects, provided that the runtime CPU supports it, and will not use locks. This includes all modern x86-64 CPUs. (Sufficiently recent GCC can also use 16-byte AVX loads and stores where available, which are also documented as atomic.)

    The exceptions are certain early 64-bit CPUs sold around 2004-2006 which lack cmpxchg16b support. See How prevalent are old x64 processors lacking the cmpxchg16b instruction? (on SuperUser). When running on such CPUs, libatomic will fall back to a lock-based implementation that does call pthread_mutex_lock.

    Your test is running into bug 60790, affecting GCC 8.3 and earlier, which would cause the fallback implementation to always be selected when -z now was used. libatomic uses indirect functions to select the correct implementations at runtime, which involve calling a "resolver function" and setting the GOT entry based on its result. Each indirect function has its own resolver, so to avoid all of them having to execute cpuid, GCC up to 8.3 had a function init_cpuid() which ran as an ELF constructor. It would execute cpuid and save the result in internal variables, which the resolvers would then consult. However, under some conditions, such as -z now, the resolver would run before the init_cpuid() constructor, and thus would see these internal variables as 0. In particular, the bit indicating cmpxchg16b support would not be set, so the resolvers would select the fallback lock-based functions.

    That explains what you're observing in your test.

    The fix in 8.4 and later has each resolver check whether the variable has been initialized, and initialize it if not, so the correct implementation is called even with -z now.


    There's a question of whether this means that 16-byte atomic types are "lock free". Older versions of GCC, apparently including the one you're using, would indeed report is_lock_free() == true if cmpxchg16b was available on the runtime CPU. That matches the behavior you saw in your test - the only problem was that due to the aforementioned bug, the cmpxchg16b code wasn't actually called. (The libatomic function underlying is_lock_free() was not affected by the bug, as it wasn't implemented as an indirect function; it would simply look at the type size and the value of the cpuid result variable.)

    Since then, the GCC developers decided to change this and unconditionally report is_lock_free() == false for all 16-byte types on x86-64. The feeling was that programmers expected is_lock_free() == true to imply that loads and stores were done with "fast" instructions, whereas using cmpxchg16b for stores could in principle require an unbounded number of loops if there is heavy contention. See Genuinely test std::atomic is lock-free or not and the relevant patch with discussion. So that's what you'll observe when testing this with modern GCC versions; but cmpxchg16b is still being used.

    This change was actually made in GCC 7 and was already in place in GCC 8.3, so perhaps your GCC 8.3 compiler is linking against a libatomic from an even earlier GCC version?