128bit by 64bit native division

I need to perform 128bit by 64bit divisions in Rust. The x86-64 ISA contains a native DIV instruction for this purpose. However, my compiled test code doesn't use this instruction.

Test code:

pub fn div(hi: u64, lo: u64, divisor: u64) -> u64 {
   assert!(hi < divisor);

   let dividend = ((hi as u128) << 64) + lo as u128;
   (dividend / divisor as u128) as u64
}

Compiler explorer output:

example::div:
    push    rax
    cmp     rdi, rdx
    jae     .LBB0_1
    mov     rax, rdi
    mov     rdi, rsi
    mov     rsi, rax
    xor     ecx, ecx
    call    qword ptr [rip + __udivti3@GOTPCREL]
    pop     rcx
    ret
.LBB0_1:
    ...

Instead, an inefficient 128bit by 128bit division is performed via __udivti3. This is probably because the DIV instruction causes a CPU exception if the quotient does not fit into 64 bits. In my case, however, this is impossible: hi < divisor, lo < 2^64 -> dividend = hi * 2^64 + lo <= (divisor - 1) * 2^64 + 2^64 - 1 = divisor * 2^64 - 1 -> dividend / divisor <= 2^64 - 1 / divisor < 2^64

How can I force the compiler to use the native instruction?

Solution

Your only option is to use inline assembly. There might be an obscure combination of compiler flags that can coerce llvm into performing the optimization itself, but I don't think trying to find it is worth the effort. With assembly, it would look like this:

use std::arch::asm;
pub fn div(hi: u64, lo: u64, divisor: u64) -> u64 {

    assert!(hi < divisor);

    #[cfg(target_arch = "x86_64")]
    unsafe {
        let mut quot = lo;
        let mut _rem = hi;
        asm!(
            "div {divisor}",
            divisor = in(reg) divisor,
            inout("rax") quot,
            inout("rdx") _rem,
            options(pure, nomem, nostack)
        );
        quot
    }
    #[cfg(not(target_arch = "x86_64"))]
    {
        let dividend = ((hi as u128) << 64) + lo as u128;
        (dividend / divisor as u128) as u64
    }
}

Godbolt

On x86_64, this just compiles the division down to a little register shuffling followed by a div, and performs the call to __udivti3 on other systems. It also shouldn't get in the way of the optimizer too much since it's pure.

It's definitely worth actually benchmarking your application to see if this actually helps. It's a lot easier for llvm to reason about integer division than inline assembly, and missed optimizations elsewhere could easily result in this version running slower than using the default version.