The reason why I ask this question is that, when testing the behavior of the Linux soft-dirty bit, I found that if I create a thread without touching any memory, the soft-dirty bit of all pages will be set to 1 (dirty).
For example, malloc(100MB)
in the main thread, then clean soft dirty bits, then create a thread that just sleeps. After the thread is created, the soft-dirty bit of all that 100MB memory chunk is set to 1.
Here is the test program I'm using:
#include <thread>
#include <iostream>
#include <vector>
#include <cstdint>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#define PAGE_SIZE_4K 0x1000
int GetDirtyBit(uint64_t vaddr) {
int fd = open("/proc/self/pagemap", O_RDONLY);
if (fd < 0) {
perror("Failed open pagemap");
exit(1);
}
off_t offset = vaddr / 4096 * 8;
if (lseek(fd, offset, SEEK_SET) < 0) {
perror("Failed lseek pagemap");
exit(1);
}
uint64_t pfn = 0;
if (read(fd, &pfn, sizeof(pfn)) != sizeof(pfn)) {
perror("Failed read pagemap");
sleep(1000);
exit(1);
}
close(fd);
return pfn & (1UL << 55) ? 1 : 0;
}
void CleanSoftDirty() {
int fd = open("/proc/self/clear_refs", O_RDWR);
if (fd < 0) {
perror("Failed open clear_refs");
exit(1);
}
char cmd[] = "4";
if (write(fd, cmd, sizeof(cmd)) != sizeof(cmd)) {
perror("Failed write clear_refs");
exit(1);
}
close(fd);
}
int demo(int argc, char *argv[]) {
int x = 1;
// 100 MB
uint64_t size = 1024UL * 1024UL * 100;
void *ptr = malloc(size);
for (uint64_t s = 0; s < size; s += PAGE_SIZE_4K) {
// populate pages
memset(ptr + s, x, PAGE_SIZE_4K);
}
char *cptr = reinterpret_cast<char *>(ptr);
printf("Soft dirty after malloc: %ld, (50MB offset)%ld\n",
GetDirtyBit(reinterpret_cast<uint64_t>(cptr)),
GetDirtyBit(reinterpret_cast<uint64_t>(cptr + 50 * 1024 * 1024)));
printf("ALLOCATE FINISHED\n");
std::string line;
std::vector<std::thread> threads;
while (true) {
sleep(2);
// Set soft dirty of all pages to 0.
CleanSoftDirty();
char *cptr = reinterpret_cast<char *>(ptr);
printf("Soft dirty after reset: %ld, (50MB offset)%ld\n",
GetDirtyBit(reinterpret_cast<uint64_t>(cptr)),
GetDirtyBit(reinterpret_cast<uint64_t>(cptr + 50 * 1024 * 1024)));
// Create thread.
threads.push_back(std::thread([]() { while(true) sleep(1); }));
sleep(2);
printf("Soft dirty after create thread: %ld, (50MB offset)%ld\n",
GetDirtyBit(reinterpret_cast<uint64_t>(cptr)),
GetDirtyBit(reinterpret_cast<uint64_t>(cptr + 50 * 1024 * 1024)));
// memset the first 20MB
memset(cptr, x++, 1024UL * 1024UL * 20);
printf("Soft dirty after memset: %ld, (50MB offset)%ld\n",
GetDirtyBit(reinterpret_cast<uint64_t>(cptr)),
GetDirtyBit(reinterpret_cast<uint64_t>(cptr + 50 * 1024 * 1024)));
}
return 0;
}
int main(int argc, char *argv[]) {
std::string last_arg = argv[argc - 1];
printf("PID: %d\n", getpid());
return demo(argc, argv);
}
I print the dirty bit of the first page, and the page at offset 50 * 1024 * 1024
. Here is what happens:
malloc()
are 1, which is expected.memset()
, and the soft-dirty bit of page 50 MB
remains 0.Here is the output:
Soft dirty after malloc: 1, (50MB offset)1
ALLOCATE FINISHED
Soft dirty after reset: 0, (50MB offset)0
Soft dirty after create thread: 1, (50MB offset)1
Soft dirty after memset: 1, (50MB offset)1
(steps 1-4 above)
(step 5 starts below)
Soft dirty after reset: 0, (50MB offset)0
Soft dirty after create thread: 0, (50MB offset)0
Soft dirty after memset: 1, (50MB offset)0
Soft dirty after reset: 0, (50MB offset)0
Soft dirty after create thread: 0, (50MB offset)0
Soft dirty after memset: 1, (50MB offset)0
Soft dirty after reset: 0, (50MB offset)0
Soft dirty after create thread: 0, (50MB offset)0
Soft dirty after memset: 1, (50MB offset)0
I thought thread creation would just mark the pages as being in a "shared" state, not modify them, so the soft-dirty bit should remain unchanged. Apparently, the behavior is different. Therefore I'm thinking: does creating a thread trigger page faults on all of the pages? So the OS sets all pages' soft-dirty bit to 1 when handling the page fault.
If this is not the case, why does creating a thread make all memory pages of the process become "dirty"? Why does only the first thread creation have such behavior?
I hope I explained the question well, please let me know if more details are needed, or if anything doesn't make sense.
So, this is kind of funny and interesting. Your specific situation, as well as the behavior of the soft-dirty bits, are quite peculiar. No page faults are happening, and the soft-dirty bit is not being set on all memory pages, but just on some of them (the ones you allocated through malloc
).
If you run your program under strace
you will notice a couple of things that will help explain what you are observing:
[1] mmap(NULL, 104861696, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8669b66000
...
[2] mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f8669365000
[2] mprotect(0x7f8669366000, 8388608, PROT_READ|PROT_WRITE) = 0
[2] clone(child_stack=0x7f8669b64fb0, ...) = 97197
...
As you can see above:
Your malloc()
is pretty large, so you will not get a normal heap chunk, but a dedicated memory area reserved through a mmap
syscall.
When you create a thread, library code sets up a stack for the thread through another mmap
followed by mprotect
.
The normal mmap
behavior in Linux is to reserve memory starting from a mmap_base
chosen at process creation time, subtracting each time the size of the request (unless a specific address is explicitly requested, in which case mmap_base
is not considered). For this reason, the mmap
at point 1 will reserve pages right above the last shared library mapped by the dynamic loader, and the mmap
at point 2 above will reserve pages right before the pages mapped at point 1. The mprotect
will then mark this second area (except for the very first page) as RW.
Since these mappings are contiguous, both anonymous and both with the same protections (RW), from the kernel's perspective this looks like a single memory region that has grown in size. In fact, the kernel treats this as a single VMA (vm_area_struct
).
Now, as we can read from the kernel documentation about the soft-dirty bit (notice the part I highlighted in bold):
While in most cases tracking memory changes by #PF-s is more than enough there is still a scenario when we can lose soft dirty bits -- a task unmaps a previously mapped memory region and then maps a new one at exactly the same place. When unmap is called, the kernel internally clears PTE values including soft dirty bits. To notify user space application about such memory region renewal the kernel always marks new memory regions (and expanded regions) as soft dirty.
So the reason why you see the soft-dirty bit re-appear on the initial malloc'd chunk of memory after clearing it is a funny coincidence: a result of the not-so-intuitive "expansion" of the memory region (VMA) containing it caused by the allocation of the thread stack.
To make things clearer, we can inspect the virtual memory layout of the process through /proc/[pid]/maps
at different stages. It will look something like this (taken from my machine):
Before malloc()
:
...
5653d8b82000-5653d8b83000 r--p 00005000 00:18 77464613 [your program]
5653d8b83000-5653d8b84000 rw-p 00006000 00:18 77464613 [your program]
5653d983f000-5653d9860000 rw-p 00000000 00:00 0 [heap]
7f866ff6c000-7f866ff79000 r--p 00000000 00:18 77146186 [shared libraries]
7f866ff79000-7f8670013000 r-xp 0000d000 00:18 77146186 [shared libraries]
...
After malloc()
:
...
5653d8b82000-5653d8b83000 r--p 00005000 00:18 77464613 [your program]
5653d8b83000-5653d8b84000 rw-p 00006000 00:18 77464613 [your program]
5653d983f000-5653d9860000 rw-p 00000000 00:00 0 [heap]
7f8669b66000-7f866ff6c000 rw-p 00000000 00:00 0 *** MALLOC'D MEMORY
7f866ff6c000-7f866ff79000 r--p 00000000 00:18 77146186 [shared libraries]
7f866ff79000-7f8670013000 r-xp 0000d000 00:18 77146186 [shared libraries]
...
After creating the first thread (notice how the start of the VMA changes from 7f8669b66000
to 7f8669366000
since it has grown in size):
...
5653d8b82000-5653d8b83000 r--p 00005000 00:18 77464613 [your program]
5653d8b83000-5653d8b84000 rw-p 00006000 00:18 77464613 [your program]
5653d983f000-5653d9860000 rw-p 00000000 00:00 0 [heap]
7f8669365000-7f8669366000 ---p 00000000 00:00 0 *** GUARD PAGE
7f8669366000-7f866ff6c000 rw-p 00000000 00:00 0 *** THREAD STACK + MALLOC'D MEMORY
7f866ff6c000-7f866ff79000 r--p 00000000 00:18 77146186 [shared libraries]
7f866ff79000-7f8670013000 r-xp 0000d000 00:18 77146186 [shared libraries]
...
You can clearly see that, after creating the thread, the kernel shows the two memory regions (thread stack + your malloc'd chunk) together as a single VMA, given that they are contiguous, anonymous and have the same protections (rw
).
The guard page above the thread stack is treated as a separate VMA (it has different protections), and subsequent threads will mmap
their stack above it, so they will not affect the soft-dirty bits of your original memory region:
...
5653d8b82000-5653d8b83000 r--p 00005000 00:18 77464613 [your program]
5653d8b83000-5653d8b84000 rw-p 00006000 00:18 77464613 [your program]
5653d983f000-5653d9860000 rw-p 00000000 00:00 0 [heap]
7f8668363000-7f8668364000 ---p 00000000 00:00 0 *** GUARD PAGE
7f8668364000-7f8668b64000 rw-p 00000000 00:00 0 *** THREAD 3 STACK
7f8668b64000-7f8668b65000 ---p 00000000 00:00 0 *** GUARD PAGE
7f8668b65000-7f8669365000 rw-p 00000000 00:00 0 *** THREAD 2 STACK
7f8669365000-7f8669366000 ---p 00000000 00:00 0 *** GUARD PAGE
7f8669366000-7f866ff6c000 rw-p 00000000 00:00 0 *** THREAD 1 STACK + MALLOC'D MEMORY
7f866ff6c000-7f866ff79000 r--p 00000000 00:18 77146186 [shared libraries]
7f866ff79000-7f8670013000 r-xp 0000d000 00:18 77146186 [shared libraries]
...
This is why from the second thread onward you don't see anything unexpected happening.