rustmemory-managementlinux-kernelpage-fault

Twice number of page faults for access to zero-initialized memory


Here's a simple Rust program that iterates through an array:

use std::time::Instant;

fn main() {
    let mut array = vec![0i32; 64 * 1024 * 1024];

    let start = Instant::now();
    workload(&mut array);
    let dur = start.elapsed().as_millis();

    println!("duration: {}ms", dur);
}

fn workload(arr: &mut [i32]) {
    for i in 0..arr.len() {
        arr[i] *= 3;
    }
}

Playground link

This measured time is 278ms when run with the --release flag.

If I make a small change, to initialize the vector with 1s instead of 0s, there's a dramatic speedup - it now reports the measured time as 17ms.

Playground link

I see a similar speedup on my own machine.

I'm not 100% sure why this speedup occurs, but running perf on this program shows:

65K is the number of pages required for the 256MB array on my machine with 4KiB page sizes.

I can understand why there are 65K page faults, then, in the ones-initialized case: one fault per page.

Why are there 2x the number of faults in the zero-initialized case? I am aware of the zero page optimization, but shouldn't that still only incur a single minor page fault?

As a secondary question, could this be the reason for the dramatic speedup? I'm not sure how to attribute/measure time spent in handling page faults.


Solution

  • This is due to the way zero-initialized memory allocations are handled by the memory allocator in conjunction with the OS. There is a conflict of goals here between saving physical memory that has been requested and needs to be allocated, but actually remains unused, on one hand; and runtime on the other.

    The first version (vec![0, ...]) will be optimized by the Rust compiler to a call to __rust_alloc_zeroed(), which in of itself will (probably) use calloc() to request memory from the underlying OS. calloc is purpose-built for allocating zero-initialized memory to allow for overcommitment: As the kernel knows that the memory is supposed to be zero'ed, there is no need upfront to actually allocate that memory and zero-initialize it. What userspace gets from calloc() is a pointer to a region of memory that the kernel has allocated for that process without actually backing it by physical memory. On the first read-access to a page in that memory-region, a page fault is triggered and handled by the kernel; the kernel will handle that page-fault by backing the requested page with a unique zero-initialized physical page. This also means that if you read(!) a giant vec![0, ...] allocation, you're actually reading the same physical page over and over again - allowing the OS to safe on physical memory; this is also why reading a giant zero-initialized, pristine, region of memory is stupendously fast - no bits actually need to hit the wires from RAM to cache. On the first write-access to a page in that memory-region, another page-fault is triggered; the kernel handles that page-fault by backing the page with a unique physical page, so the memory-content can actually change (the one-zero-page-to-rule-them-all needs to remain zeroed of course). This is why you see two page faults per page in your first examples: Once for reading zero during arr[i] *= 3, and once for writing back the result.

    In the second version (vec![1, ...]) the memory is allocated without that memory-savings path, and immediately overwritten with 1; this write triggers the one page fault you are observing; the arr[i] *= 3-step no longer triggers a page-fault, because the memory is fully allocated by then.