Here's a simple Rust program that iterates through an array:
use std::time::Instant;
fn main() {
let mut array = vec![0i32; 64 * 1024 * 1024];
let start = Instant::now();
workload(&mut array);
let dur = start.elapsed().as_millis();
println!("duration: {}ms", dur);
}
fn workload(arr: &mut [i32]) {
for i in 0..arr.len() {
arr[i] *= 3;
}
}
This measured time is 278ms when run with the --release
flag.
If I make a small change, to initialize the vector with 1
s instead of 0
s, there's a dramatic speedup - it now reports the measured time as 17ms.
I see a similar speedup on my own machine.
I'm not 100% sure why this speedup occurs, but running perf
on this program shows:
65K is the number of pages required for the 256MB array on my machine with 4KiB page sizes.
I can understand why there are 65K page faults, then, in the ones-initialized case: one fault per page.
Why are there 2x the number of faults in the zero-initialized case? I am aware of the zero page optimization, but shouldn't that still only incur a single minor page fault?
As a secondary question, could this be the reason for the dramatic speedup? I'm not sure how to attribute/measure time spent in handling page faults.
This is due to the way zero-initialized memory allocations are handled by the memory allocator in conjunction with the OS. There is a conflict of goals here between saving physical memory that has been requested and needs to be allocated, but actually remains unused, on one hand; and runtime on the other.
The first version (vec![0, ...]
) will be optimized by the Rust compiler to a call to __rust_alloc_zeroed()
, which in of itself will (probably) use calloc()
to request memory from the underlying OS. calloc
is purpose-built for allocating zero-initialized memory to allow for overcommitment: As the kernel knows that the memory is supposed to be zero'ed, there is no need upfront to actually allocate that memory and zero-initialize it. What userspace gets from calloc()
is a pointer to a region of memory that the kernel has allocated for that process without actually backing it by physical memory.
On the first read-access to a page in that memory-region, a page fault is triggered and handled by the kernel; the kernel will handle that page-fault by backing the requested page with a unique zero-initialized physical page. This also means that if you read(!) a giant vec![0, ...]
allocation, you're actually reading the same physical page over and over again - allowing the OS to safe on physical memory; this is also why reading a giant zero-initialized, pristine, region of memory is stupendously fast - no bits actually need to hit the wires from RAM to cache. On the first write-access to a page in that memory-region, another page-fault is triggered; the kernel handles that page-fault by backing the page with a unique physical page, so the memory-content can actually change (the one-zero-page-to-rule-them-all needs to remain zeroed of course). This is why you see two page faults per page in your first examples: Once for reading zero during arr[i] *= 3
, and once for writing back the result.
In the second version (vec![1, ...]
) the memory is allocated without that memory-savings path, and immediately overwritten with 1
; this write triggers the one page fault you are observing; the arr[i] *= 3
-step no longer triggers a page-fault, because the memory is fully allocated by then.