The code is naive:
use std::time;
fn main() {
const NUM_LOOP: u64 = std::u64::MAX;
let mut sum = 0u64;
let now = time::Instant::now();
for i in 0..NUM_LOOP {
sum += i;
}
let d = now.elapsed();
println!("{}", sum);
println!("loop: {}.{:09}s", d.as_secs(), d.subsec_nanos());
}
The output is:
$ ./test.rs.out
9223372036854775809
loop: 0.000000060s
$ ./test.rs.out
9223372036854775809
loop: 0.000000052s
$ ./test.rs.out
9223372036854775809
loop: 0.000000045s
$ ./test.rs.out
9223372036854775809
loop: 0.000000041s
$ ./test.rs.out
9223372036854775809
loop: 0.000000046s
$ ./test.rs.out
9223372036854775809
loop: 0.000000047s
$ ./test.rs.out
9223372036854775809
loop: 0.000000045s
The program almost ends immediately. I also wrote an equivalent code in C using for loop, but it ran for a long time. I'm wondering what makes the Rust code so fast.
The C code:
#include <stdint.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
double time_elapse(struct timespec start) {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
return now.tv_sec - start.tv_sec +
(now.tv_nsec - start.tv_nsec) / 1000000000.;
}
int main() {
const uint64_t NUM_LOOP = 18446744073709551615u;
uint64_t sum = 0;
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
for (int i = 0; i < NUM_LOOP; ++i) {
sum += i;
}
double t = time_elapse(now);
printf("value of sum is: %llu\n", sum);
printf("time elapse is: %lf sec\n", t);
return 0;
}
The Rust code is compiled using -O
and the C code is compiled using -O3
. The C code is running so slow that it hasn't stopped yet.
After fixing the bug found by visibleman and Sandeep, both programs were printing the same number in almost no time. I tried to reduce NUM_LOOP
by one, results seemed reasonable considering an overflow. Moreover, with NUM_LOOP = 1000000000
, both programs will not have overflow and produce correct answers in no time. What optimizations are used here? I know we can use simple equations like (0 + NUM_LOOP - 1) * NUM_LOOP / 2
to compute the result, but I don't think such computations are done by the compilers with an overflow case...
Your Rust code (without the prints and timing) compiles down to (On Godbolt):
movabs rax, -9223372036854775807
ret
LLVM just const-folds the whole function and calculates the final value for you.
Let's make the upper limit dynamic (non constant) to avoid this aggressive constant folding:
pub fn foo(num: u64) -> u64 {
let mut sum = 0u64;
for i in 0..num {
sum += i;
}
sum
}
This results in (Godbolt):
test rdi, rdi ; if num == 0
je .LBB0_1 ; jump to .LBB0_1
lea rax, [rdi - 1] ; sum = num - 1
lea rcx, [rdi - 2] ; rcx = num - 2
mul rcx ; sum = sum * rcx
shld rdx, rax, 63 ; rdx = sum / 2
lea rax, [rdx + rdi] ; sum = rdx + num
add rax, -1 ; sum -= 1
ret
.LBB0_1:
xor eax, eax ; sum = 0
ret
As you can see that optimizer understood that you summed all numbers from 0 to num
and replaced your loop with a constant formula: ((num - 1) * (num - 2)) / 2 + num - 1
. As for the example above: the optimizer probably first optimized the code into this constant formula and did constant folding then.
clang
generates exactly the same assembly (unsurprisingly). However, GCC doesn't seem to know about this optimization and generates pretty much the assembly you would expect (a loop).(0..num).sum()
. Despite this using more layers of abstractions (namely, iterators), the compiler generates exactly the same code as above.Duration
in Rust, you can use the {:?}
format specifier. println!("{:.2?}", d);
prints the duration in the most fitting unit with a precision of 2. That's a fine way to print the time for almost all kinds of benchmarks.