performanceassemblyx86performancecounterintel-pmu

rdpmc: surprising behavior


I'm trying to understand the rdpmc instruction. As such I have the following asm code:

segment .text
global _start

_start:
    xor eax, eax
    mov ebx, 10
.loop:
    dec ebx
    jnz .loop

    mov ecx, 1<<30
    ; calling rdpmc with ecx = (1<<30) gives number of retired instructions
    rdpmc
    ; but only if you do a bizarre incantation: (Why u do dis Intel?)
    shl rdx, 32
    or  rax, rdx

    mov rdi, rax ; return number of instructions retired.
    mov eax, 60
    syscall

(The implementation is a translation of rdpmc_instructions().) I count that this code should execute 2*ebx+3 instructions before hitting the rdpmc instruction, so I expect (in this case) that I should get a return status of 23.

If I run perf stat -e instruction:u ./a.out on this binary, perf tells me that I've executed 30 instructions, which looks about right. But if I execute the binary, I get a return status of 58, or 0, not deterministic.

What have I done wrong here?


Solution

  • The fixed counters don't count all the time, only when software has enabled them. Normally (the kernel side of) perf does this, along with resetting them to zero before starting a program.

    The fixed counters (like the programmable counters) have bits that control whether they count in user, kernel, or user+kernel (i.e. always). I assume Linux's perf kernel code leaves them set to count neither when nothing is using them.

    If you want to use raw RDPMC yourself, you need to either program / enable the counters (by setting the corresponding bits in the IA32_PERF_GLOBAL_CTRL and IA32_FIXED_CTR_CTRL MSRs), or get perf to do it for you by still running your program under perf. e.g. perf stat ./a.out

    If you use perf stat -e instructions:u ./perf ; echo $?, the fixed counter will actually be zeroed before entering your code so you get consistent results from using rdpmc once. Otherwise, e.g. with the default -e instructions (not :u) you don't know the initial value of the counter. You can fix that by taking a delta, reading the counter once at start, then once after your loop.

    The exit status is only 8 bits wide, so this little hack to avoid printf or write() only works for very small counts.

    It also means its pointless to construct the full 64-bit rdpmc result: the high 32 bits of the inputs don't affect the low 8 bits of a sub result because carry propagates only from low to high. In general, unless you expect counts > 2^32, just use the EAX result. Even if the raw 64-bit counter wrapped around during the interval you measured, your subtraction result will still be a correct small integer in a 32-bit register.


    Simplified even more than in your question. Also note indenting the operands so they can stay at a consistent column even for mnemonics longer than 3 letters.

    segment .text
    global _start
    
    _start:
        mov   ecx, 1<<30      ; fixed counter: instructions
        rdpmc
        mov   edi, eax        ; start
    
        mov   edx, 10
    .loop:
        dec   edx
        jnz   .loop
    
        rdpmc               ; ecx = same counter as before
    
        sub   eax, edi       ; end - start
    
        mov   edi, eax
        mov   eax, 231
        syscall             ; sys_exit_group(rdpmc).  sys_exit isn't wrong, but glibc uses exit_group.
    

    Running this under perf stat ./a.out or perf stat -e instructions:u ./a.out, we always get 23 from echo $? (instructions:u shows 30, which is 1 more than the actual number of instructions this program runs, including syscall)

    23 instructions is exactly the number of instructions strictly after the first rdpmc, but including the 2nd rdpmc.

    If we comment out the first rdpmc and run it under perf stat -e instructions:u, we consistently get 26 as the exit status, and 29 from perf. rdpmc is the 24th instruction to be executed. (And RAX starts out initialized to zero because this is a Linux static executable, so the dynamic linker didn't run before _start). I wonder if the sysret in the kernel gets counted as a "user" instruction.

    But with the first rdpmc commented out, running under perf stat -e instructions (not :u) gives arbitrary values as the starting value of the counter isn't fixed. So we're just taking (some arbitrary starting point + 26) mod 256 as the exit status.

    But note that RDPMC is not a serializing instruction, and can execute out of order. In general you maybe need lfence, or (as John McCalpin suggests in the thread you linked) giving ECX a false dependency on the results of instructions you care about. e.g. and ecx, 0 / or ecx, 1<<30 works, because unlike xor-zeroing, and ecx,0 is not dependency-breaking.

    Nothing weird happens in this program because the front-end is the only bottleneck, so all the instructions execute basically as soon as they're issued. Also, the rdpmc is right after the loop, so probably a branch mispredict of the loop-exit branch prevents it from being issued into the OoO back-end before the loop finishes.


    PS for future readers: one way to enable user-space RDPMC on Linux without any custom modules beyond what perf requires is documented in perf_event_open(2):

    echo 2 | sudo tee /sys/devices/cpu/rdpmc    # enable RDPMC always, not just when a perf event is open