I want to use RDTSC in Rust to benchmark how many ticks my function takes.
There's a built-in std::arch::x86_64::_rdtsc
, alas it always translates into:
rdtsc
shl rdx, 32
or rax, rdx
As output always goes to EDX:EAX or RDX:EAX pair (even on x64 platforms), it glues them up. I don't need that extra stuff, as it affects measurement. I need a plain RDTSC.
I tried to hint compiler that having them separated is ok for me, hoping that optimizer would get what I want:
fn tick_count() -> (u64, u64) {
let x = unsafe { _rdtsc() };
(x&0xFFFFFFFF, x >> 32)
}
Alas, it didn't work:
rdtsc
shl rdx, 32
or rdx, rax
mov eax, edx
shr rdx, 32
ret
Basically, it just do ping-pong, first shifting and joining results, then separating them again by doing totally reverse actions.
Why it happens and is there any way to write it in plain Rust without asm!
?
Try it out on Godbolt.
First of all, as @user555045 commented, very short intervals where this overhead would matter are very hard to microbenchmark. If this overhead is a problem, your measurements probably won't mean much even if you do avoid it unless you know exactly what you're doing and what you're trying to figure out about the CPU you're doing this on. It's not usually useful to answer a "which is faster" question.
If you can use just the low 32 bits of _rdtsc
(e.g. for measuring intervals much less than 2^32 reference cycles), not using the high half of the result at all does let the shift/or optimize away. If end - start
fits in 32 bits, it doesn't change the result if end
and start
are truncated to 32 bits first. (Carry/borrow only propagates from low to high.)
Use u32 end.wrapping_sub(start)
, since the full TSC could have crossed a 32-bit boundary between start and end.
It seems _rdtsc()
uses a fixed recipe to produce a single u64 which LLVM isn't able to optimize except to optimize it away if the result is unused. As if the shift/OR was a non-volatile asm statement which the optimizer doesn't understand, except as a black box whose result is either used or not. The LLVM-IR for it is %x = tail call noundef i64 @llvm.x86.rdtsc()
, producing a single 64-bit integer return value.
use std::arch::x86_64::_rdtsc;
pub fn tick_low() -> u32 {
let x = unsafe { _rdtsc() };
x as u32
}
(Godbolt, for this and the later examples)
# rustc 1.89 -O
example::tick_low:
rdtsc
ret
If you ask for it zero-extended to 64-bit with x as u32 as u64
or x & 0xFFFFFFFF
, the compiler unfortunately adds a mov eax, eax
instruction to zero-extend EAX into RAX, even though it already is zero-extended from rdtsc
writing EAX.
Using u32
halves instead of u64
in your tick_count
function can avoid the mov eax, edx
you got, but otherwise is the same mess:
pub fn tick_count() -> (u32, u32) {
let x = unsafe { _rdtsc() };
(x as u32, (x >> 32) as u32) // low 32 in EAX, high 32 in EDX
}
example::tick_count:
rdtsc
shl rdx, 32
or rdx, rax
# mov eax, edx for the low half is *not* present, unlike with u64
shr rdx, 32
ret
There doesn't seem much hope for doing anything with the high half without having LLVM pack and unpack it instead of just rdtsc; mov eax, edx
pub fn tick_high() -> u32 {
let x = unsafe { _rdtsc() };
(x>>32) as u32
}
example::tick_high:
rdtsc
shl rdx, 32
or rax, rdx
shr rax, 32
ret
LLVM has no problem with this optimization normally, so it's definitely something about the canned sequence of operations that the optimizer doesn't see into at the stages of compilation where this kind of optimization gets looked for.
#[inline(never)]
pub fn combine_and_unpack(lo :u32, hi :u32) -> u64 {
let combined = lo as u64 | ((hi as u64)<<32);
return combined >> 32; // LLVM can do this optimization normally
}
example::combine_and_unpack:
mov eax, esi
ret
rdtsc
You at least need lfence
to block instruction reordering of the work wrt. rdtsc
(or use rdtscp
at the end.) And it's basically going to be time for the instructions to retire, which is neither throughput nor latency, the things that are normally useful to measure about a tiny fragment of code.
Usually much better to put something in a repeat loop either with or without a data dependency from previous output to next input, and time across many iterations. (Or count core clock cycles with perf stat
, especially for code which doesn't do I/O or spend a lot of its time stalled waiting for memory, so usually runs in the same number of core clocks regardless of clock speed. TSC reference cycles are wall-clock time on CPUs from the past couple decades, not core clocks.)
See also
addsd
throughput or latency. The code in the question had a silly bug, but their intended methodology was also flawed.