cmultithreadingpthreadsx86-64thread-local-storage

TLS variable lookup speed


Consider the following program which starts multiple threads which increment a variable that's either global or thread-local:

#include <pthread.h>

static int final_value = 0;

#ifdef TLS_VAR
static int __thread tls_var;
#else
static int tls_var;
#endif

void  __attribute__ ((noinline)) modify_tls(void) {
  tls_var++;
}

void *thread_function(void *unused) {
  const int iteration_count = 1 << 25;

  tls_var = 0;
  for (int i = 0; i < iteration_count; i++) {
    modify_tls();
  }
  final_value += tls_var;
  return NULL;
}

int main() {
  const int thread_count = 1 << 7;

  pthread_t thread_ids[thread_count];
  for (int i = 0; i < thread_count; i++) {
    pthread_create(&thread_ids[i], NULL, thread_function, NULL);
  }

  for (int i = 0; i < thread_count; i++) {
    pthread_join(thread_ids[i], NULL);
  }

  return 0;
}

On my i7, it takes 1.308 seconds to execute with TLS_VAR defined and 8.392 seconds with it undefined; and I am unable to account for such a huge difference.

The assembly for modify_tls looks like this (I've only mentioned the parts that are different):

;; !defined(TLS_VAR)
movl tls_var(%rip), %eax
addl $1, %eax
movl %eax, tls_var(%rip)

;; defined(TLS_VAR)
movl %fs:tls_var@tpoff, %eax
addl $1, %eax
movl %eax, %fs:tls_var@tpoff

The TLS lookup is understandable, with a load from the TCB. But why is the tls_var load in the first case relative to %rip? Why can't it be a direct memory address which gets relocated by the loader? Is this %rip relative load responsible for the slowness? If so, why?

Compile flags: gcc -O3 -std=c99 -Wall -Werror -lpthread


Solution

  • Without the __thread attribute tls_var is simply a shared variable. Whenever one thread writes to it, the write goes first to the cache of the core, where the thread executes. But since it is a shared variable and x86 machines are cache coherent, the caches of the other cores get invalidated and their content refreshed from the last-level cache or from the main memory (in your case most likely from the last-level cache, which is the shared L3 cache on Core i7). Note that although faster than the main memory, the last-level cache is not infinitely fast - it still takes lots of cycles to get data from there moved to the L2 and L1 caches, private to each core.

    With the __thread attribute, each thread gets its own copy of tls_var, located in the thread-local storage. Since these thread-local storages are wide apart from each other in memory, no cache coherency messages are involved when they are being modified and the data stays in the fastest L1 cache.

    RIP-related addressing (the recommended by the System V ABI for x64 default addressing mode for "near" data) usually leads to faster data access, but the cache coherency overhead is so huge that the slower TLS access is actually faster when everything is kept in the L1 cache.

    This problem is hugely magnified on NUMA systems, e.g. on a multiprocessor (post-)Nehalem or AMD64 boards. Not only is it much more expensive to keep caches coherent, but also the shared variable would reside in the memory, attached to the socket, where the thread that first "touched" the variable has resided. Threads that run on cores from other sockets would then have to perform remote memory access through the QPI or HT bus that connects the sockets. As one visiting professor said recently (a rough paraphrase): "Program shared memory systems as if they are distributed memory systems." This involves making local copies of global data to work on - exactly what the __thread attribute achieves.

    Also note, that with and without tls_var being in the TLS, you should expect different results. With it being in the TLS, modifications that one thread has made are not visible to the other threads. With it being a shared variable, you have to make sure that no more than one thread could access it at a given time. This is usually achieved with a critical section or a locked addition.