rust code-generation compiler-optimization

How is rust able to optimize Option::is_some_and so effectively?

Looking at the codegen of a check inside for-loop I wanted to see if there is an optimization opportunity by outlining is_some_and but both cases had the same codegen.

struct V {
    len: Option<u64>,
}

pub fn check(num: u64, v: &V, vec: Vec<u64>) -> u64 {
    for i in 0..num {
        if v.len.is_some_and(|m| vec[i as usize] > m) {
            return 20;
        }
    };
    100
}

pub fn check_out(num: u64, v: &V, vec: Vec<u64>) -> u64 {
    let vlen = v.len;
    let vlen_is_some = v.len.is_some();
    for i in 0..num {
        if vlen_is_some && (vec[i as usize] > vlen.unwrap()) {
            return 20;
        }
    };
    100
}

In both cases the loop has the same codegen.

// check
.LBB0_3:
        cmp     rsi, rcx
        je      .LBB0_14
        cmp     qword ptr [rbx + 8*rcx], rax
        ja      .LBB0_16
        inc     rcx
        cmp     rdi, rcx
        jne     .LBB0_3



// check_out
.LBB1_3:
        cmp     rsi, rcx
        je      .LBB1_4
        cmp     qword ptr [rbx + 8*rcx], rax
        ja      .LBB1_16
        inc     rcx
        cmp     rdi, rcx
        jne     .LBB1_3

https://godbolt.org/z/P6e89Wcrn for reference.

How is rust (compiler/type-system) able to do that or if something can be improved still?

Solution

There are multiple possible ways that the Rust compiler could work this optimisation out, and I don't know which specific one it's using, but here's one of the simplest it could be using:

V contains no fields that have interior mutability (i.e. it contains no Cells);
so the memory referenced by a &V must remain unchanged for as long as the &V exists (this is one of the most fundamental rules of shared references, broken only by Cell and its variants);
so v.len must remain the same throughout the loop, and it's safe to cache calculations based on it

LLVM (the optimiser backend used by Rust) likes to move things from inside loops to outside when possible, because this is almost always faster. So, v.len can be moved to before the loop rather than reading it each time around the loop.

The next thing to note is that is_some_and is generic – it takes an impl FnOnce(T) -> bool as its argument, which makes it generic on the type of the closure. Each closure has a different, anonymous type; let's call this type {closure#1}. So, after desugaring the impl FnOnce into a generic, you are calling what is in effect Option::<u64>::is_some_and::<{closure#1}>. This is a very specific type – the closure in question doesn't appear anywhere else in the program – meaning that there is only one call to is_some_and with this particular type. Rust generates a copy of each function for each type it is used with, and so this very particular instance of is_some_and ends up only getting called once. Because it's a function that's only called once, there is no reason not to inline it (inlining normally hurts code size due to making copies of the function body, which is a reason not to do it everywhere, but because the function is called only once there's no need to make additional copies). So, the compiler will in effect insert the body of is_some_and, with the closure hardcoded, into your code.

With the hardcoded is_some_and, the Rust compiler is then able to see that the data it's looking at (the None/Someness of the option, and the value) don't change around the loop. (When you write code that handles Options, you don't normally know or care about how they're represented internally – but the compiler knows, and is able to read uninitialised data from the Some variant of an Option<u64> even while it holds None, something that doesn't work with all Option instances but does for that one specifically. It's possible that the representation of Option<u64> will change in the future, but the optimiser will of course be working with the current representation.) At this point, the loop would, if it were written in Rust rather than in LLVM's internal representation, look something like this:

let vlen = v.len;
let vlen_is_some = vlen.is_some();
let mut vlen_unwrapped: MaybeUninit<u64> = MaybeUninit::uninit();
if let Some(l) = vlen { vlen_unwrapped.write(l); }
for i in 0..num {
    if vlen_is_some && (vec[i as usize] > unsafe { vlen_unwrapped.assume_init() }) {
        return 20;
    }
}

This is now close enough to your second version that the two optimise the same way from there (the optimisation sequence continues by moving the if vlen_is_some outside the loop because it is the same on every iteration).

The basic rules of Rust that make all this possible, though, are a) the targets of shared references don't change, except inside cells and b) a separate version of each generic function is generated for each set of type generics it could use, which in the case of functions that accept closures normally means that the function gets inlined at its call site.

(Note that even a C compiler, which would have less information, might be able to get this one right – it wouldn't be able to use the reasoning above, but it might instead see that the loop never writes to memory and thus *len can't change, and might decide that is_some_and is simple enough to be worth inlining even if it were used more than once. This sort of code isn't actually difficult for modern optimisers to handle well.)