How to use plain RDTSC without using asm?

I want to use RDTSC in Rust to benchmark how many ticks my function takes.

There's a built-in std::arch::x86_64::_rdtsc, alas it always translates into:

   rdtsc
   shl     rdx, 32
   or      rax, rdx

As output always goes to EDX:EAX or RDX:EAX pair (even on x64 platforms), it glues them up. I don't need that extra stuff, as it affects measurement. I need a plain RDTSC.

I tried to hint compiler that having them separated is ok for me, hoping that optimizer would get what I want:

fn tick_count() -> (u64, u64) {
    let x = unsafe { _rdtsc() };
    (x&0xFFFFFFFF, x >> 32)
}

Alas, it didn't work:

    rdtsc
    shl     rdx, 32
    or      rdx, rax
    mov     eax, edx
    shr     rdx, 32
    ret

Basically, it just do ping-pong, first shifting and joining results, then separating them again by doing totally reverse actions.

Why it happens and is there any way to write it in plain Rust without asm!?

Try it out on Godbolt.

Solution

First of all, as @user555045 commented, very short intervals where this overhead would matter are very hard to microbenchmark. If this overhead is a problem, your measurements probably won't mean much even if you do avoid it unless you know exactly what you're doing and what you're trying to figure out about the CPU you're doing this on. It's not usually useful to answer a "which is faster" question.

If you can use just the low 32 bits of _rdtsc (e.g. for measuring intervals much less than 2^32 reference cycles), not using the high half of the result at all does let the shift/or optimize away. If end - start fits in 32 bits, it doesn't change the result if end and start are truncated to 32 bits first. (Carry/borrow only propagates from low to high.)

Use u32 end.wrapping_sub(start), since the full TSC could have crossed a 32-bit boundary between start and end.

It seems _rdtsc() uses a fixed recipe to produce a single u64 which LLVM isn't able to optimize except to optimize it away if the result is unused. As if the shift/OR was a non-volatile asm statement which the optimizer doesn't understand, except as a black box whose result is either used or not. The LLVM-IR for it is %x = tail call noundef i64 @llvm.x86.rdtsc(), producing a single 64-bit integer return value.

use std::arch::x86_64::_rdtsc;

pub fn tick_low() -> u32 {
    let x = unsafe { _rdtsc() };
    x as u32
}

(Godbolt, for this and the later examples)

# rustc 1.89 -O
example::tick_low:
        rdtsc
        ret

If you ask for it zero-extended to 64-bit with x as u32 as u64 or x & 0xFFFFFFFF, the compiler unfortunately adds a mov eax, eax instruction to zero-extend EAX into RAX, even though it already is zero-extended from rdtsc writing EAX.

Using u32 halves instead of u64 in your tick_count function can avoid the mov eax, edx you got, but otherwise is the same mess:

pub fn tick_count() -> (u32, u32) {
    let x = unsafe { _rdtsc() };
    (x as u32, (x >> 32) as u32)  // low 32 in EAX, high 32 in EDX
}

example::tick_count:
        rdtsc
        shl     rdx, 32
        or      rdx, rax
                 # mov eax, edx for the low half is *not* present, unlike with u64
        shr     rdx, 32
        ret

There doesn't seem much hope for doing anything with the high half without having LLVM pack and unpack it instead of just rdtsc; mov eax, edx

pub fn tick_high() -> u32 {
    let x = unsafe { _rdtsc() };
    (x>>32) as u32
}

example::tick_high:
        rdtsc
        shl     rdx, 32
        or      rax, rdx
        shr     rax, 32
        ret

LLVM has no problem with this optimization normally, so it's definitely something about the canned sequence of operations that the optimizer doesn't see into at the stages of compilation where this kind of optimization gets looked for.

#[inline(never)]
pub fn combine_and_unpack(lo :u32, hi :u32) -> u64 {
    let combined = lo as u64 | ((hi as u64)<<32);
    return combined >> 32;  // LLVM can do this optimization normally
}

example::combine_and_unpack:
        mov     eax, esi
        ret

Microbenchmarking with `rdtsc`

You at least need lfence to block instruction reordering of the work wrt. rdtsc (or use rdtscp at the end.) And it's basically going to be time for the instructions to retire, which is neither throughput nor latency, the things that are normally useful to measure about a tiny fragment of code.

Usually much better to put something in a repeat loop either with or without a data dependency from previous output to next input, and time across many iterations. (Or count core clock cycles with perf stat, especially for code which doesn't do I/O or spend a lot of its time stalled waiting for memory, so usually runs in the same number of core clocks regardless of clock speed. TSC reference cycles are wall-clock time on CPUs from the past couple decades, not core clocks.)

How to use plain RDTSC without using asm?

Microbenchmarking with rdtsc

Microbenchmarking with `rdtsc`