I wondered if it is possible if two threads belonging to the same program with the same PCID can share the TLB entry when they are scheduled to run on the same physical CPU?
I already looked into the SDM (https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html); page 3115 (TLB and HT) does not mention any sharing mechanism. But another part of the document states that before accessing the TLB entry, the PCID value is checked, and if it is equal, the value is used. However, there is also a bit for the current thread set next to the PCID identifier.
My question: is the PCID value used with priority over the CPU-thread bit or is it necessary that both values match?
From my observations, it is not possible (at least for the dTLB
), even though it would bring performance benefits.
As suggested by Peter, I wrote a small program that consists of two worker threads that access the same heap region over and over again.
Compile with -O0
to prevent optimization.
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <inttypes.h>
#include <err.h>
#include <sched.h>
#include <sys/mman.h>
#define PAGE_SIZE 4096
int repetitions = 1ll << 20;
uint64_t ptrsize = 1ll<<18;
uint64_t main_cpu, co_cpu ;
void pin_task_to(int pid, int cpu)
{
cpu_set_t cset;
CPU_ZERO(&cset);
CPU_SET(cpu, &cset);
if (sched_setaffinity(pid, sizeof(cpu_set_t), &cset))
err(1, "affinity");
}
void pin_to(int cpu) { pin_task_to(0, cpu); }
void *foo(void *p)
{
pin_to(main_cpu);
int value;
uint8_t *ptr = (uint8_t *)p;
printf("Running on CPU: %d\n", sched_getcpu());
for (size_t j = 0; j < repetitions; j++)
{
for (size_t i = 0; i < ptrsize; i += PAGE_SIZE)
{
value += ptr[i];
}
}
volatile int dummy = value;
pthread_exit(NULL);
}
void *boo(void *p)
{
pin_to(co_cpu);
int value;
uint8_t *ptr = (uint8_t *)p;
printf("Running on CPU: %d\n", sched_getcpu());
for (size_t j = 0; j < repetitions; j++)
{
for (size_t i = 0; i < ptrsize; i+=PAGE_SIZE)
{
value += ptr[i];
}
}
volatile int dummy = value;
pthread_exit(NULL);
}
int main(int argc, char **argv)
{
if (argc < 3){
exit(-1);
}
main_cpu = strtoul(argv[1], NULL, 16);
co_cpu = strtoul(argv[2], NULL, 16);
pthread_t id[2];
void *mptr = malloc(ptrsize);
pthread_create(&id[0], NULL, foo, mptr);
pthread_create(&id[1], NULL, boo, mptr);
pthread_join(id[0], NULL);
pthread_join(id[1], NULL);
}
I decided to sum up all the values in the memory region (obviously, the value
will overflow) to prevent the CPU from doing microarchitectural optimization.
[The other Idea was to simply dereference the memory region byte by byte and load the value in RAX
]
We go over the memory region repetitions
times to reduce the noise within one run induced by the slightly different startup time of the threads and other processes and interrupts on the system.
My machine has four physical and eight logical cores. Logical core x and x+4 are located on the same physical one (lstopo).
CPU: Intel Core i5 8250u
Running on the same logical core
Since the kernel uses PCIDs to identify TLB entries, a context switch to the other thread should not invalidate the TLBs.
> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 1
Running on CPU: 1
Running on CPU: 1
Performance counter stats for './main 1 1':
12,621,724 dtlb_load_misses.stlb_hit:u # 49.035 M/sec
1,152 dtlb_load_misses.miss_causes_a_walk:u # 4.475 K/sec
834,363,092 cycles:u # 3.241 GHz
257.40 msec task-clock:u # 0.997 CPUs utilized
0.258177969 seconds time elapsed
0.258253000 seconds user
0.000000000 seconds sys
Running on two different physical cores
No TLB sharing or interference whatsoever.
> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 2
Running on CPU: 1
Running on CPU: 2
Performance counter stats for './main 1 2':
11,740,758 dtlb_load_misses.stlb_hit:u # 45.962 M/sec
1,647 dtlb_load_misses.miss_causes_a_walk:u # 6.448 K/sec
834,021,644 cycles:u # 3.265 GHz
255.44 msec task-clock:u # 1.991 CPUs utilized
0.128304564 seconds time elapsed
0.255768000 seconds user
0.000000000 seconds sys
Running on the same physical core
If TLB sharing is possible, I would expect to have here the lowest sTLB
hits and a low number of dTLB
page walks. But instead, we have the highest number in both cases.
> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 5
Running on CPU: 1
Running on CPU: 5
Performance counter stats for './main 1 5':
140,040,429 dtlb_load_misses.stlb_hit:u # 291.368 M/sec
198,827 dtlb_load_misses.miss_causes_a_walk:u # 413.680 K/sec
1,596,298,827 cycles:u # 3.321 GHz
480.63 msec task-clock:u # 1.990 CPUs utilized
0.241509701 seconds time elapsed
0.480996000 seconds user
0.000000000 seconds sys
As you can see, we have the most sTLB
hits and dTLB
page walks when running on the same physical core. Thus, I would follow from it that there is no sharing mechanism for the same PCID on the same physical core. Running the process on the same logical core and two different physical cores results in roughly the same amount of misses/hits to the sTLB. This further supports the thesis that there is sharing on the same logical core but not on the physical one.
As suggested by Peter also use a linked-list approach to prevent THP and prefetching. The modified data is shown below.
Compile with -O0
to prevent optimization
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <inttypes.h>
#include <err.h>
#include <sched.h>
#include <time.h>
#include <sys/mman.h>
#define PAGE_SIZE 4096
const int repetitions = 1ll << 20;
const uint64_t ptrsize = 1ll<< 5;
uint64_t main_cpu, co_cpu ;
void pin_task_to(int pid, int cpu)
{
cpu_set_t cset;
CPU_ZERO(&cset);
CPU_SET(cpu, &cset);
if (sched_setaffinity(pid, sizeof(cpu_set_t), &cset))
err(1, "affinity");
}
void pin_to(int cpu) { pin_task_to(0, cpu); }
void *foo(void *p)
{
pin_to(main_cpu);
uint64_t *value;
uint64_t *ptr = (uint64_t *)p;
printf("Running on CPU: %d\n", sched_getcpu());
for (size_t j = 0; j < repetitions; j++)
{
value = ptr;
for (size_t i = 0; i < ptrsize; i++)
{
value = (uint64_t *)*value;
}
}
volatile uint64_t *dummy = value;
pthread_exit(NULL);
}
void *boo(void *p)
{
pin_to(co_cpu);
uint64_t *value;
uint64_t *ptr = (uint64_t *)p;
printf("Running on CPU: %d\n", sched_getcpu());
for (size_t j = 0; j < repetitions; j++)
{
value = ptr;
for (size_t i = 0; i < ptrsize; i++)
{
value = (uint64_t *)*value;
}
}
volatile uint64_t *dummy = value;
pthread_exit(NULL);
}
int main(int argc, char **argv)
{
if (argc < 3){
exit(-1);
}
srand(time(NULL));
uint64_t *head,*tail,*tmp_ptr;
int r;
head = mmap(NULL,PAGE_SIZE,PROT_READ|PROT_WRITE,MAP_PRIVATE | MAP_ANONYMOUS,0,0);
tail = head;
for (size_t i = 0; i < ptrsize; i++)
{
r = (rand() & 0xF) +1;
// try to use differents offset to the next page to prevent microarch prefetching
tmp_ptr = mmap(tail-r*PAGE_SIZE, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
*tail = (uint64_t)tmp_ptr;
tail = tmp_ptr;
}
printf("%Lx, %lx\n", head, *head);
main_cpu = strtoul(argv[1], NULL, 16);
co_cpu = strtoul(argv[2], NULL, 16);
pthread_t id[2];
pthread_create(&id[0], NULL, foo, head);
pthread_create(&id[1], NULL, boo, head);
pthread_join(id[0], NULL);
pthread_join(id[1], NULL);
}
Same Logical Core
> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 1
7feac4d90000, 7feac4d5b000
Running on CPU: 1
Running on CPU: 1
Performance counter stats for './main 1 1':
3,696 dtlb_load_misses.stlb_hit:u # 11.679 K/sec
743 dtlb_load_misses.miss_causes_a_walk:u # 2.348 K/sec
762,856,367 cycles:u # 2.410 GHz
316.48 msec task-clock:u # 0.998 CPUs utilized
0.317105072 seconds time elapsed
0.316859000 seconds user
0.000000000 seconds sys
Different Physical Cores
> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 2
7f59bb395000, 7f59bb34d000
Running on CPU: 1
Running on CPU: 2
Performance counter stats for './main 1 2':
15,144 dtlb_load_misses.stlb_hit:u # 49.480 K/sec
756 dtlb_load_misses.miss_causes_a_walk:u # 2.470 K/sec
770,800,780 cycles:u # 2.518 GHz
306.06 msec task-clock:u # 1.982 CPUs utilized
0.154410840 seconds time elapsed
0.306345000 seconds user
0.000000000 seconds sys
Same Physical Core / Different Logical Cores
> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 5
7f7d69e8b000, 7f7d69e56000
Running on CPU: 5
Running on CPU: 1
Performance counter stats for './main 1 5':
9,237,992 dtlb_load_misses.stlb_hit:u # 20.554 M/sec
789 dtlb_load_misses.miss_causes_a_walk:u # 1.755 K/sec
1,007,185,858 cycles:u # 2.241 GHz
449.45 msec task-clock:u # 1.989 CPUs utilized
0.225947522 seconds time elapsed
0.449813000 seconds user
0.000000000 seconds sys