I need to perform 128bit by 64bit divisions in Rust. The x86-64 ISA contains a native DIV instruction for this purpose. However, my compiled test code doesn't use this instruction.
Test code:
pub fn div(hi: u64, lo: u64, divisor: u64) -> u64 {
assert!(hi < divisor);
let dividend = ((hi as u128) << 64) + lo as u128;
(dividend / divisor as u128) as u64
}
Compiler explorer output:
example::div:
push rax
cmp rdi, rdx
jae .LBB0_1
mov rax, rdi
mov rdi, rsi
mov rsi, rax
xor ecx, ecx
call qword ptr [rip + __udivti3@GOTPCREL]
pop rcx
ret
.LBB0_1:
...
Instead, an inefficient 128bit by 128bit division is performed via __udivti3. This is probably because the DIV instruction causes a CPU exception if the quotient does not fit into 64 bits. In my case, however, this is impossible: hi < divisor, lo < 2^64 -> dividend = hi * 2^64 + lo <= (divisor - 1) * 2^64 + 2^64 - 1 = divisor * 2^64 - 1 -> dividend / divisor <= 2^64 - 1 / divisor < 2^64
How can I force the compiler to use the native instruction?
Your only option is to use inline assembly. There might be an obscure combination of compiler flags that can coerce llvm into performing the optimization itself, but I don't think trying to find it is worth the effort. With assembly, it would look like this:
use std::arch::asm;
pub fn div(hi: u64, lo: u64, divisor: u64) -> u64 {
assert!(hi < divisor);
#[cfg(target_arch = "x86_64")]
unsafe {
let mut quot = lo;
let mut _rem = hi;
asm!(
"div {divisor}",
divisor = in(reg) divisor,
inout("rax") quot,
inout("rdx") _rem,
options(pure, nomem, nostack)
);
quot
}
#[cfg(not(target_arch = "x86_64"))]
{
let dividend = ((hi as u128) << 64) + lo as u128;
(dividend / divisor as u128) as u64
}
}
On x86_64, this just compiles the division down to a little register shuffling followed by a div
, and performs the call to __udivti3
on other systems. It also shouldn't get in the way of the optimizer too much since it's pure.
It's definitely worth actually benchmarking your application to see if this actually helps. It's a lot easier for llvm to reason about integer division than inline assembly, and missed optimizations elsewhere could easily result in this version running slower than using the default version.