What optimization techniques are applied to Rust code that sums up a simple arithmetic sequence?

The code is naive:

use std::time;

fn main() {
    const NUM_LOOP: u64 = std::u64::MAX;
    let mut sum = 0u64;
    let now = time::Instant::now();
    for i in 0..NUM_LOOP {
        sum += i;
    }
    let d = now.elapsed();
    println!("{}", sum);
    println!("loop: {}.{:09}s", d.as_secs(), d.subsec_nanos());
}

The output is:

$ ./test.rs.out
9223372036854775809
loop: 0.000000060s
$ ./test.rs.out
9223372036854775809
loop: 0.000000052s
$ ./test.rs.out
9223372036854775809
loop: 0.000000045s
$ ./test.rs.out
9223372036854775809
loop: 0.000000041s
$ ./test.rs.out
9223372036854775809
loop: 0.000000046s
$ ./test.rs.out
9223372036854775809
loop: 0.000000047s
$ ./test.rs.out
9223372036854775809
loop: 0.000000045s

The program almost ends immediately. I also wrote an equivalent code in C using for loop, but it ran for a long time. I'm wondering what makes the Rust code so fast.

The C code:

#include <stdint.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

double time_elapse(struct timespec start) {
    struct timespec now;
    clock_gettime(CLOCK_MONOTONIC, &now);
    return now.tv_sec - start.tv_sec +
           (now.tv_nsec - start.tv_nsec) / 1000000000.;
}

int main() {
    const uint64_t NUM_LOOP = 18446744073709551615u;
    uint64_t sum = 0;
    struct timespec now;
    clock_gettime(CLOCK_MONOTONIC, &now);

    for (int i = 0; i < NUM_LOOP; ++i) {
        sum += i;
    }

    double t = time_elapse(now);
    printf("value of sum is: %llu\n", sum);
    printf("time elapse is: %lf sec\n", t);

    return 0;
}

The Rust code is compiled using -O and the C code is compiled using -O3. The C code is running so slow that it hasn't stopped yet.

After fixing the bug found by visibleman and Sandeep, both programs were printing the same number in almost no time. I tried to reduce NUM_LOOP by one, results seemed reasonable considering an overflow. Moreover, with NUM_LOOP = 1000000000, both programs will not have overflow and produce correct answers in no time. What optimizations are used here? I know we can use simple equations like (0 + NUM_LOOP - 1) * NUM_LOOP / 2 to compute the result, but I don't think such computations are done by the compilers with an overflow case...

Solution

Your Rust code (without the prints and timing) compiles down to (On Godbolt):

movabs rax, -9223372036854775807
ret

LLVM just const-folds the whole function and calculates the final value for you.

Let's make the upper limit dynamic (non constant) to avoid this aggressive constant folding:

pub fn foo(num: u64) -> u64 {
    let mut sum = 0u64;
    for i in 0..num {
        sum += i;
    }

    sum
}

This results in (Godbolt):

  test rdi, rdi            ; if num == 0
  je .LBB0_1               ; jump to .LBB0_1
  lea rax, [rdi - 1]       ; sum = num - 1
  lea rcx, [rdi - 2]       ; rcx = num - 2
  mul rcx                  ; sum = sum * rcx
  shld rdx, rax, 63        ; rdx = sum / 2
  lea rax, [rdx + rdi]     ; sum = rdx + num
  add rax, -1              ; sum -= 1
  ret
.LBB0_1:
  xor eax, eax             ; sum = 0
  ret

As you can see that optimizer understood that you summed all numbers from 0 to num and replaced your loop with a constant formula: ((num - 1) * (num - 2)) / 2 + num - 1. As for the example above: the optimizer probably first optimized the code into this constant formula and did constant folding then.

Additional notes

The two other answers already point out your bug in the C program. When fixed, clang generates exactly the same assembly (unsurprisingly). However, GCC doesn't seem to know about this optimization and generates pretty much the assembly you would expect (a loop).
In Rust, an easier and more idiomatic way to write your code would be (0..num).sum(). Despite this using more layers of abstractions (namely, iterators), the compiler generates exactly the same code as above.
To print a Duration in Rust, you can use the {:?} format specifier. println!("{:.2?}", d); prints the duration in the most fitting unit with a precision of 2. That's a fine way to print the time for almost all kinds of benchmarks.