linux multithreading linux-kernel pthreads page-fault

Does thread creation trigger page faults in Linux? How does it relate to soft-dirty PTEs?

The reason why I ask this question is that, when testing the behavior of the Linux soft-dirty bit, I found that if I create a thread without touching any memory, the soft-dirty bit of all pages will be set to 1 (dirty).

For example, malloc(100MB) in the main thread, then clean soft dirty bits, then create a thread that just sleeps. After the thread is created, the soft-dirty bit of all that 100MB memory chunk is set to 1.

Here is the test program I'm using:

#include <thread>
#include <iostream>
#include <vector>
#include <cstdint>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>

#define PAGE_SIZE_4K 0x1000

int GetDirtyBit(uint64_t vaddr) {
  int fd = open("/proc/self/pagemap", O_RDONLY);
  if (fd < 0) {
    perror("Failed open pagemap");
    exit(1);
  }

  off_t offset = vaddr / 4096 * 8;
  if (lseek(fd, offset, SEEK_SET) < 0) {
    perror("Failed lseek pagemap");
    exit(1);
  }

  uint64_t pfn = 0;
  if (read(fd, &pfn, sizeof(pfn)) != sizeof(pfn)) {
    perror("Failed read pagemap");
    sleep(1000);
    exit(1);
  }
  close(fd);

  return pfn & (1UL << 55) ? 1 : 0;
}

void CleanSoftDirty() {
  int fd = open("/proc/self/clear_refs", O_RDWR);
  if (fd < 0) {
    perror("Failed open clear_refs");
    exit(1);
  }

  char cmd[] = "4";
  if (write(fd, cmd, sizeof(cmd)) != sizeof(cmd)) {
    perror("Failed write clear_refs");
    exit(1);
  }

  close(fd);
}

int demo(int argc, char *argv[]) {
  int x = 1;
  // 100 MB
  uint64_t size = 1024UL * 1024UL * 100;
  void *ptr = malloc(size);
  for (uint64_t s = 0; s < size; s += PAGE_SIZE_4K) {
    // populate pages
    memset(ptr + s, x, PAGE_SIZE_4K);
  }

  char *cptr = reinterpret_cast<char *>(ptr);
  printf("Soft dirty after malloc: %ld, (50MB offset)%ld\n",
        GetDirtyBit(reinterpret_cast<uint64_t>(cptr)),
        GetDirtyBit(reinterpret_cast<uint64_t>(cptr + 50 * 1024 * 1024)));

  printf("ALLOCATE FINISHED\n");

  std::string line;
  std::vector<std::thread> threads;
  while (true) {
    sleep(2);
    // Set soft dirty of all pages to 0.
    CleanSoftDirty();

    char *cptr = reinterpret_cast<char *>(ptr);
    printf("Soft dirty after reset: %ld, (50MB offset)%ld\n",
      GetDirtyBit(reinterpret_cast<uint64_t>(cptr)),
      GetDirtyBit(reinterpret_cast<uint64_t>(cptr + 50 * 1024 * 1024)));
    
    // Create thread.
    threads.push_back(std::thread([]() { while(true) sleep(1); }));

    sleep(2);
    
    printf("Soft dirty after create thread: %ld, (50MB offset)%ld\n",
      GetDirtyBit(reinterpret_cast<uint64_t>(cptr)),
      GetDirtyBit(reinterpret_cast<uint64_t>(cptr + 50 * 1024 * 1024)));

    // memset the first 20MB
    memset(cptr, x++, 1024UL * 1024UL * 20);
    printf("Soft dirty after memset: %ld, (50MB offset)%ld\n",
      GetDirtyBit(reinterpret_cast<uint64_t>(cptr)),
      GetDirtyBit(reinterpret_cast<uint64_t>(cptr + 50 * 1024 * 1024)));
  }

  return 0;
}

int main(int argc, char *argv[]) {
  std::string last_arg = argv[argc - 1];
  printf("PID: %d\n", getpid());

  return demo(argc, argv);
}

I print the dirty bit of the first page, and the page at offset 50 * 1024 * 1024. Here is what happens:

The soft-dirty bits after malloc() are 1, which is expected.
After clean soft-dirty, they become 0.
Create a thread that just sleeps.
Check dirty bit, all pages in the 100MB region (I didn't print dirty bits of all pages, but I did the check on my own) now have the soft-dirty bit set to 1.
Restart the loop, now the behavior is correct, soft-dirty bits remain 0 after creating additional threads.
The soft-dirty bit of the page at offset 0 is 1 since I did memset(), and the soft-dirty bit of page 50 MB remains 0.

Here is the output:

Soft dirty after malloc: 1, (50MB offset)1
ALLOCATE FINISHED
Soft dirty after reset: 0, (50MB offset)0
Soft dirty after create thread: 1, (50MB offset)1
Soft dirty after memset: 1, (50MB offset)1

(steps 1-4 above)
(step 5 starts below)
Soft dirty after reset: 0, (50MB offset)0
Soft dirty after create thread: 0, (50MB offset)0
Soft dirty after memset: 1, (50MB offset)0

Soft dirty after reset: 0, (50MB offset)0
Soft dirty after create thread: 0, (50MB offset)0
Soft dirty after memset: 1, (50MB offset)0

Soft dirty after reset: 0, (50MB offset)0
Soft dirty after create thread: 0, (50MB offset)0
Soft dirty after memset: 1, (50MB offset)0

I thought thread creation would just mark the pages as being in a "shared" state, not modify them, so the soft-dirty bit should remain unchanged. Apparently, the behavior is different. Therefore I'm thinking: does creating a thread trigger page faults on all of the pages? So the OS sets all pages' soft-dirty bit to 1 when handling the page fault.

If this is not the case, why does creating a thread make all memory pages of the process become "dirty"? Why does only the first thread creation have such behavior?

I hope I explained the question well, please let me know if more details are needed, or if anything doesn't make sense.

Solution

So, this is kind of funny and interesting. Your specific situation, as well as the behavior of the soft-dirty bits, are quite peculiar. No page faults are happening, and the soft-dirty bit is not being set on all memory pages, but just on some of them (the ones you allocated through malloc).

If you run your program under strace you will notice a couple of things that will help explain what you are observing:

[1] mmap(NULL, 104861696, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8669b66000
    ...
[2] mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f8669365000
[2] mprotect(0x7f8669366000, 8388608, PROT_READ|PROT_WRITE) = 0
[2] clone(child_stack=0x7f8669b64fb0, ...) = 97197
    ...

As you can see above:

Your malloc() is pretty large, so you will not get a normal heap chunk, but a dedicated memory area reserved through a mmap syscall.
When you create a thread, library code sets up a stack for the thread through another mmap followed by mprotect.

The normal mmap behavior in Linux is to reserve memory starting from a mmap_base chosen at process creation time, subtracting each time the size of the request (unless a specific address is explicitly requested, in which case mmap_base is not considered). For this reason, the mmap at point 1 will reserve pages right above the last shared library mapped by the dynamic loader, and the mmap at point 2 above will reserve pages right before the pages mapped at point 1. The mprotect will then mark this second area (except for the very first page) as RW.

Since these mappings are contiguous, both anonymous and both with the same protections (RW), from the kernel's perspective this looks like a single memory region that has grown in size. In fact, the kernel treats this as a single VMA (vm_area_struct).

Now, as we can read from the kernel documentation about the soft-dirty bit (notice the part I highlighted in bold):

While in most cases tracking memory changes by #PF-s is more than enough there is still a scenario when we can lose soft dirty bits -- a task unmaps a previously mapped memory region and then maps a new one at exactly the same place. When unmap is called, the kernel internally clears PTE values including soft dirty bits. To notify user space application about such memory region renewal the kernel always marks new memory regions (and expanded regions) as soft dirty.

So the reason why you see the soft-dirty bit re-appear on the initial malloc'd chunk of memory after clearing it is a funny coincidence: a result of the not-so-intuitive "expansion" of the memory region (VMA) containing it caused by the allocation of the thread stack.

To make things clearer, we can inspect the virtual memory layout of the process through /proc/[pid]/maps at different stages. It will look something like this (taken from my machine):

Before malloc():

...
5653d8b82000-5653d8b83000 r--p 00005000 00:18 77464613     [your program]
5653d8b83000-5653d8b84000 rw-p 00006000 00:18 77464613     [your program]
5653d983f000-5653d9860000 rw-p 00000000 00:00 0            [heap]
7f866ff6c000-7f866ff79000 r--p 00000000 00:18 77146186     [shared libraries]
7f866ff79000-7f8670013000 r-xp 0000d000 00:18 77146186     [shared libraries]
...

After malloc():

...
5653d8b82000-5653d8b83000 r--p 00005000 00:18 77464613     [your program]
5653d8b83000-5653d8b84000 rw-p 00006000 00:18 77464613     [your program]
5653d983f000-5653d9860000 rw-p 00000000 00:00 0            [heap]
7f8669b66000-7f866ff6c000 rw-p 00000000 00:00 0        *** MALLOC'D MEMORY
7f866ff6c000-7f866ff79000 r--p 00000000 00:18 77146186     [shared libraries]
7f866ff79000-7f8670013000 r-xp 0000d000 00:18 77146186     [shared libraries]
...

After creating the first thread (notice how the start of the VMA changes from 7f8669b66000 to 7f8669366000 since it has grown in size):

...
5653d8b82000-5653d8b83000 r--p 00005000 00:18 77464613     [your program]
5653d8b83000-5653d8b84000 rw-p 00006000 00:18 77464613     [your program]
5653d983f000-5653d9860000 rw-p 00000000 00:00 0            [heap]
7f8669365000-7f8669366000 ---p 00000000 00:00 0        *** GUARD PAGE
7f8669366000-7f866ff6c000 rw-p 00000000 00:00 0        *** THREAD STACK + MALLOC'D MEMORY
7f866ff6c000-7f866ff79000 r--p 00000000 00:18 77146186     [shared libraries]
7f866ff79000-7f8670013000 r-xp 0000d000 00:18 77146186     [shared libraries]
...

You can clearly see that, after creating the thread, the kernel shows the two memory regions (thread stack + your malloc'd chunk) together as a single VMA, given that they are contiguous, anonymous and have the same protections (rw).

The guard page above the thread stack is treated as a separate VMA (it has different protections), and subsequent threads will mmap their stack above it, so they will not affect the soft-dirty bits of your original memory region:

...
5653d8b82000-5653d8b83000 r--p 00005000 00:18 77464613     [your program]
5653d8b83000-5653d8b84000 rw-p 00006000 00:18 77464613     [your program]
5653d983f000-5653d9860000 rw-p 00000000 00:00 0            [heap]
7f8668363000-7f8668364000 ---p 00000000 00:00 0        *** GUARD PAGE
7f8668364000-7f8668b64000 rw-p 00000000 00:00 0        *** THREAD 3 STACK
7f8668b64000-7f8668b65000 ---p 00000000 00:00 0        *** GUARD PAGE
7f8668b65000-7f8669365000 rw-p 00000000 00:00 0        *** THREAD 2 STACK
7f8669365000-7f8669366000 ---p 00000000 00:00 0        *** GUARD PAGE
7f8669366000-7f866ff6c000 rw-p 00000000 00:00 0        *** THREAD 1 STACK + MALLOC'D MEMORY
7f866ff6c000-7f866ff79000 r--p 00000000 00:18 77146186     [shared libraries]
7f866ff79000-7f8670013000 r-xp 0000d000 00:18 77146186     [shared libraries]
...

This is why from the second thread onward you don't see anything unexpected happening.