I'm trying to understand the rdpmc instruction. As such I have the following asm code:
segment .text
global _start
_start:
xor eax, eax
mov ebx, 10
.loop:
dec ebx
jnz .loop
mov ecx, 1<<30
; calling rdpmc with ecx = (1<<30) gives number of retired instructions
rdpmc
; but only if you do a bizarre incantation: (Why u do dis Intel?)
shl rdx, 32
or rax, rdx
mov rdi, rax ; return number of instructions retired.
mov eax, 60
syscall
(The implementation is a translation of rdpmc_instructions().)
I count that this code should execute 2*ebx+3 instructions before hitting the rdpmc
instruction, so I expect (in this case) that I should get a return status of 23.
If I run perf stat -e instruction:u ./a.out
on this binary, perf
tells me that I've executed 30 instructions, which looks about right. But if I execute the binary, I get a return status of 58, or 0, not deterministic.
What have I done wrong here?
The fixed counters don't count all the time, only when software has enabled them. Normally (the kernel side of) perf
does this, along with resetting them to zero before starting a program.
The fixed counters (like the programmable counters) have bits that control whether
they count in user, kernel, or user+kernel (i.e. always). I assume Linux's perf
kernel code leaves them set to count neither when nothing is using them.
If you want to use raw RDPMC yourself, you need to either program / enable the counters (by setting the corresponding bits in the IA32_PERF_GLOBAL_CTRL
and IA32_FIXED_CTR_CTRL
MSRs), or get perf to do it for you by still running your program under perf
. e.g. perf stat ./a.out
If you use perf stat -e instructions:u ./perf ; echo $?
, the fixed counter will actually be zeroed before entering your code so you get consistent results from using rdpmc
once. Otherwise, e.g. with the default -e instructions
(not :u) you don't know the initial value of the counter. You can fix that by taking a delta, reading the counter once at start, then once after your loop.
The exit status is only 8 bits wide, so this little hack to avoid printf or write()
only works for very small counts.
It also means its pointless to construct the full 64-bit rdpmc
result: the high 32 bits of the inputs don't affect the low 8 bits of a sub
result because carry propagates only from low to high. In general, unless you expect counts > 2^32, just use the EAX result. Even if the raw 64-bit counter wrapped around during the interval you measured, your subtraction result will still be a correct small integer in a 32-bit register.
Simplified even more than in your question. Also note indenting the operands so they can stay at a consistent column even for mnemonics longer than 3 letters.
segment .text
global _start
_start:
mov ecx, 1<<30 ; fixed counter: instructions
rdpmc
mov edi, eax ; start
mov edx, 10
.loop:
dec edx
jnz .loop
rdpmc ; ecx = same counter as before
sub eax, edi ; end - start
mov edi, eax
mov eax, 231
syscall ; sys_exit_group(rdpmc). sys_exit isn't wrong, but glibc uses exit_group.
Running this under perf stat ./a.out
or perf stat -e instructions:u ./a.out
, we always get 23
from echo $?
(instructions:u
shows 30, which is 1 more than the actual number of instructions this program runs, including syscall
)
23 instructions is exactly the number of instructions strictly after the first rdpmc
, but including the 2nd rdpmc
.
If we comment out the first rdpmc
and run it under perf stat -e instructions:u
, we consistently get 26
as the exit status, and 29
from perf
. rdpmc
is the 24th instruction to be executed. (And RAX starts out initialized to zero because this is a Linux static executable, so the dynamic linker didn't run before _start
). I wonder if the sysret
in the kernel gets counted as a "user" instruction.
But with the first rdpmc
commented out, running under perf stat -e instructions
(not :u) gives arbitrary values as the starting value of the counter isn't fixed. So we're just taking (some arbitrary starting point + 26) mod 256 as the exit status.
But note that RDPMC is not a serializing instruction, and can execute out of order. In general you maybe need lfence
, or (as John McCalpin suggests in the thread you linked) giving ECX a false dependency on the results of instructions you care about. e.g. and ecx, 0
/ or ecx, 1<<30
works, because unlike xor-zeroing, and ecx,0
is not dependency-breaking.
Nothing weird happens in this program because the front-end is the only bottleneck, so all the instructions execute basically as soon as they're issued. Also, the rdpmc
is right after the loop, so probably a branch mispredict of the loop-exit branch prevents it from being issued into the OoO back-end before the loop finishes.
PS for future readers: one way to enable user-space RDPMC on Linux without any custom modules beyond what perf
requires is documented in perf_event_open(2)
:
echo 2 | sudo tee /sys/devices/cpu/rdpmc # enable RDPMC always, not just when a perf event is open