Consider the following program which starts multiple threads which increment a variable that's either global or thread-local:
#include <pthread.h>
static int final_value = 0;
#ifdef TLS_VAR
static int __thread tls_var;
#else
static int tls_var;
#endif
void __attribute__ ((noinline)) modify_tls(void) {
tls_var++;
}
void *thread_function(void *unused) {
const int iteration_count = 1 << 25;
tls_var = 0;
for (int i = 0; i < iteration_count; i++) {
modify_tls();
}
final_value += tls_var;
return NULL;
}
int main() {
const int thread_count = 1 << 7;
pthread_t thread_ids[thread_count];
for (int i = 0; i < thread_count; i++) {
pthread_create(&thread_ids[i], NULL, thread_function, NULL);
}
for (int i = 0; i < thread_count; i++) {
pthread_join(thread_ids[i], NULL);
}
return 0;
}
On my i7, it takes 1.308 seconds to execute with TLS_VAR
defined and
8.392 seconds with it undefined; and I am unable to account for such a huge
difference.
The assembly for modify_tls
looks like this (I've only mentioned the
parts that are different):
;; !defined(TLS_VAR)
movl tls_var(%rip), %eax
addl $1, %eax
movl %eax, tls_var(%rip)
;; defined(TLS_VAR)
movl %fs:tls_var@tpoff, %eax
addl $1, %eax
movl %eax, %fs:tls_var@tpoff
The TLS lookup is understandable, with a load from the TCB. But why
is the tls_var
load in the first case relative to %rip
? Why can't
it be a direct memory address which gets relocated by the loader? Is
this %rip
relative load responsible for the slowness? If so, why?
Compile flags: gcc -O3 -std=c99 -Wall -Werror -lpthread
Without the __thread
attribute tls_var
is simply a shared variable. Whenever one thread writes to it, the write goes first to the cache of the core, where the thread executes. But since it is a shared variable and x86 machines are cache coherent, the caches of the other cores get invalidated and their content refreshed from the last-level cache or from the main memory (in your case most likely from the last-level cache, which is the shared L3 cache on Core i7). Note that although faster than the main memory, the last-level cache is not infinitely fast - it still takes lots of cycles to get data from there moved to the L2 and L1 caches, private to each core.
With the __thread
attribute, each thread gets its own copy of tls_var
, located in the thread-local storage. Since these thread-local storages are wide apart from each other in memory, no cache coherency messages are involved when they are being modified and the data stays in the fastest L1 cache.
RIP
-related addressing (the recommended by the System V ABI for x64 default addressing mode for "near" data) usually leads to faster data access, but the cache coherency overhead is so huge that the slower TLS access is actually faster when everything is kept in the L1 cache.
This problem is hugely magnified on NUMA systems, e.g. on a multiprocessor (post-)Nehalem or AMD64 boards. Not only is it much more expensive to keep caches coherent, but also the shared variable would reside in the memory, attached to the socket, where the thread that first "touched" the variable has resided. Threads that run on cores from other sockets would then have to perform remote memory access through the QPI or HT bus that connects the sockets. As one visiting professor said recently (a rough paraphrase): "Program shared memory systems as if they are distributed memory systems." This involves making local copies of global data to work on - exactly what the __thread
attribute achieves.
Also note, that with and without tls_var
being in the TLS, you should expect different results. With it being in the TLS, modifications that one thread has made are not visible to the other threads. With it being a shared variable, you have to make sure that no more than one thread could access it at a given time. This is usually achieved with a critical section or a locked addition.