c multithreading operating-system cpu-architecture page-fault

Why does the page fault not cause the thread to finish its execution later?

I have the below code where I'm intentionally creating a page fault in one of the threads in file.c

util.c

#include "util.h"

// to use as a fence() instruction
extern inline __attribute__((always_inline))
CYCLES rdtscp(void) {
    CYCLES cycles;
    asm volatile ("rdtscp" :  "=a" (cycles));

    return cycles;
}

// initialize address
void init_ram_address(char* FILE_NAME){
    char *filename = FILE_NAME;
    int fd = open(filename, O_RDWR);
    if(fd == -1) {
        printf("Could not open file .\n");
        exit(0);
    }
    void *file_address = mmap(NULL, DEFAULT_FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, 0);
    ram_address = (int *) file_address;
}

// initialize address
void init_disk_address(char* FILE_NAME){
    char *filename = FILE_NAME;
    int fd = open(filename, O_RDWR);
    if(fd == -1) {
        printf("Could not open file .\n");
        exit(0);
    }
    void *file_address = mmap(NULL, DEFAULT_FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    disk_address = (int *) file_address;
}

file.c

#include "util.h"

void *f1();
void *f2();

pthread_barrier_t barrier;
pthread_mutex_t mutex;

int main(int argc, char **argv)
{
    pthread_t t1, t2;

    // in ram
    init_ram_address(RAM_FILE_NAME);
    // in disk
    init_disk_address(DISK_FILE_NAME);

    pthread_create(&t1, NULL, &f1, NULL);
    pthread_create(&t2, NULL, &f2, NULL);

    
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    
    return 0;
}

void *f1()
{
    rdtscp();
    int load = *(ram_address);
    rdtscp();
    printf("Expecting this to be run first.\n");
}

void *f2()
{
    rdtscp();
    int load = *(disk_address);
    rdtscp();
    printf("Expecting this to be run second.\n");
}

I've used rdtscp() in the above code for fencing purposes (to ensure that the print statement get executed only after the load operation is done).

Since t2 will incur a page fault, I expect t1 to finish executing its print statement first.

To run both the threads on the same core, I run taskset -c 10 ./file.

I see that t2 prints its statement before t1. What could be the reason for this?

Solution

I think you're expecting t2's int load = *(disk_address); to cause a context switch to t1, and since you're pinning everything to the same CPU core, that would give t1 time to win the race to take the lock for stdout.

A soft page fault doesn't need to context-switch, just update the page tables with a file page from the pagecache. Despite the mapping being backed by a disk file, not anonymous memory or just copy-on-write tricks, if the the file has been read or written recently it will be hot in the pagecache and not require I/O (which would make it a hard page fault).

Maybe try evicting disk cache before a test run, like with echo 3 | sudo tee /proc/sys/vm/drop_caches if this is Linux, so access to the mmap region without MAP_POPULATE will be a hard page fault (requiring I/O).

(See *https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache*; sync first, at least on the disk file, if it was recently written, to make sure it's page(s) are clean and able to be evicted aka dropped. Dropping caches is mainly useful for benchmarking.)

Or programmatically, you can hint the kernel with the madvise(2) system call, like madvise(MADV_DONTNEED) on a page, encouraging it to evict it from pagecache soon. (Or at least hint that your process doesn't need it; other processes might keep it hot).

In Linux kernel 5.4 and later, MADV_COLD works as a hint to evict the specified page(s) on memory pressure. ("Deactivate" probably means remove from HW page tables, so next access will at least be a soft page fault.) Or MADV_PAGEOUT is apparently supposed to get the kernel to reclaim the specified page(s) right away, I guess before the system call returns. After that, the next access should be a hard page fault.

MADV_COLD (since Linux 5.4)
Deactivate a given range of pages. This will make the pages a more probable reclaim target should there be a memory pressure. This is a nondestructive operation. The advice might be ignored for some pages in the range when it is not applicable.

MADV_PAGEOUT (since Linux 5.4) Reclaim a given range of pages. This is done to free up memory occupied by these pages. If a page is anonymous, it will be swapped out. If a page is file-backed and dirty, it will be written back to the backing storage. The advice might be ignored for some pages in the range when it is not applicable.

These madvise args are Linux-specific. The madvise system call itself (as opposed to posix_madvise) is not guaranteed portable, but the man page gives the impression that some other systems have their own madvise system calls supporting some standard "advice" hints to the kernel.

You haven't shown the declaration of ram_address or disk_address. If it's not a pointer-to-volatile like volatile int *disk_address, the loads may be optimized away at compile time. Writes to non-escaped local vars like int load don't have to respect "memory" clobbers because nothing else could possibly have a reference to them.

If you compiled without optimization or something, then yes the load will still happen even without volatile.