So for the following code, I see a strange inlining/optimisation artefact.
I'm curious to know if they are "necessary" in some hypothetical scenario I'm not appreciating?
Godbolt: https://godbolt.org/z/M8PW1obE7
#include <cstdint>
#include <stdio.h>
struct ThreadStateLogger
{
static thread_local struct Instance
{
char TLS_byteLoc[3] {' ', 0, 0};
uint8_t TLS_numBytes {1};
// 4 wasted bytes here...
char* TLS_byteLocPtr {TLS_byteLoc};
void Log(char v)
{
TLS_byteLocPtr[0] = v;
}
void Log(char v1, char v2)
{
TLS_byteLocPtr[0] = v1;
if (TLS_numBytes>1)
TLS_byteLocPtr[1] = v2;
}
} instance;
static void Log(char v1, char v2)
{
instance.Log(v1, v2);
}
// static void Log(char v1, char v2)
// {
// instance.TLS_byteLocPtr[0] = v1;
// if (instance.TLS_numBytes>1)
// instance.TLS_byteLocPtr[1] = v2;
// }
};
extern ThreadStateLogger theThreadStateLogger;
int main()
{
ThreadStateLogger::Log('a', 'b');
// printf("Hello world");
ThreadStateLogger::Log('c', 'd');
return 0;
}
The whole primary implementation gets inlined with -O3, which is what I want :-)
So it appears that the first Log() call correctly checks if this TLS needs allocating, and then gets the converted address using __tls_get_addr@PLT, which is all good.
The second Log() call also apparently checks if the object needs initialising, but then uses the cached addresses from the first call (rbx)! So if it did initialise, that could be wrong?
Below is the output from clang16 on godbolt, which is comparable to gcc - the same reinitialization test and cached address, but it does somewhat better than my current clang10 with -fPIC. https://godbolt.org/z/M8PW1obE7
main: # @main
push rbx
cmp qword ptr [rip + _ZTHN17ThreadStateLogger8instanceE@GOTPCREL], 0
je .LBB0_2
call TLS init function for ThreadStateLogger::instance@PLT
.LBB0_2:
data16
lea rdi, [rip + ThreadStateLogger::instance@TLSGD]
data16
data16
rex64
call __tls_get_addr@PLT
mov rbx, rax
mov rax, qword ptr [rax + 8]
mov byte ptr [rax], 97
cmp byte ptr [rbx + 3], 2
jae .LBB0_3
cmp qword ptr [rip + _ZTHN17ThreadStateLogger8instanceE@GOTPCREL], 0
jne .LBB0_5
.LBB0_6:
mov rax, qword ptr [rbx + 8]
mov byte ptr [rax], 99
cmp byte ptr [rbx + 3], 2
jae .LBB0_7
.LBB0_8:
xor eax, eax
pop rbx
ret
.LBB0_3:
mov rax, qword ptr [rbx + 8]
mov byte ptr [rax + 1], 98
cmp qword ptr [rip + _ZTHN17ThreadStateLogger8instanceE@GOTPCREL], 0
je .LBB0_6
.LBB0_5:
call TLS init function for ThreadStateLogger::instance@PLT
mov rax, qword ptr [rbx + 8]
mov byte ptr [rax], 99
cmp byte ptr [rbx + 3], 2
jb .LBB0_8
.LBB0_7:
mov rax, qword ptr [rbx + 8]
mov byte ptr [rax + 1], 100
xor eax, eax
pop rbx
ret
edits
Removed a printf - ?MORE? init checks have been added, but __tls_get_addr is still cached in rbx.
So the runtime cost is only a repeated check of the TLS-singleton-initialised flag, which we know is set. There is the code bloat as well, which is annoying.
tldr
So reminder of the question: Why is the init check/call repeated? If that is necessary, then why does the address not need to be regenerated? Or is this just an optimisation nobody has thought of doing? Is there any way to get a better optimisation by juggling the code pattern? I have had a few goes to get to here, as you can see.
Of course, I need the functionality of a logging class with this behaviour here, and the use of a class to encapsulate the 3 TLS variables means they are all constructed together instead of 1-at-a-time.
LLVM and likely GCC are unable to optimize this. Devirtualization, thread-local storage initialization, and other language features can be optimized locally within a function, but once they are inlined into another, there is no further potential.
The problem becomes clear when we look at the LLVM IR that clang outputs for this:
static void Log(char c) {
// we suffer the same issue with one and two chars, this just makes our IR shorter
instance.Log(c);
}
After all optimization passes, this function becomes:
define linkonce_odr void @ThreadStateLogger::Log(char)(i8 noundef signext %v1) local_unnamed_addr #1 comdat align 2 {
entry:
br i1 icmp ne (ptr @TLS init function for ThreadStateLogger::instance, ptr null), label %0, label %TLS wrapper function for ThreadStateLogger::instance.exit
0: ; preds = %entry
call void @TLS init function for ThreadStateLogger::instance()
br label %TLS wrapper function for ThreadStateLogger::instance.exit
TLS wrapper function for ThreadStateLogger::instance.exit: ; preds = %entry, %0
%1 = call align 8 ptr @llvm.threadlocal.address.p0(ptr align 8 @ThreadStateLogger::instance)
%TLS_byteLocPtr.i = getelementptr inbounds %"struct.ThreadStateLogger::Instance", ptr %1, i64 0, i32 2
%2 = load ptr, ptr %TLS_byteLocPtr.i, align 8
store i8 %v1, ptr %2, align 1
ret void
}
The important thing is that this function contains:
call void @TLS init function for ThreadStateLogger::instance()
(demangled)call void @_ZTHN17ThreadStateLogger8instanceE()
(mangled)It's not a magic function, and the compiler has no way of knowing that calling it once makes future calls unnecessary. The function call is located in a branch which reads from mutable global memory:
cmp qword ptr [rip + _ZTHN17ThreadStateLogger8instanceE@GOTPCREL], 0
and there is no IR-level information that this memory must be set to != 0
by our call void @_ZTHN17ThreadStateLogger8instanceE()
.
If you subsequently inline ThreadStateLogger::Log(char)
into main
, the two branches which contains this function call don't obsolete each other, and you see two instances of call TLS init function for ThreadStateLogger::instance@PLT
__tls_get_addr
get optimized then?__tls_get_addr
is @llvm.threadlocal.address.p0
, it's likely subject to optimizations that a "random" call to a _ZTHN17ThreadStateLogger8instanceE
function wouldn't have.
The call gets eliminated during an EarlyCSEPass in main
. See compiler explorer with LLVM Optimization Pipeline.