I always thought that instructions for killing dependencies, e.g xor reg, reg
do not have to be executed and are ready for retirement as soon as the Renamer moves them to the Re-order Buffer.
I just measure the number of microoperations getting into the RS with the event uops_issued.any
and was surprised by the number. All the xor reg, reg
for killing dependency were accounted in the perf event.
Why wouldn't just put killing dependency to ROB, without uselessly disturbing the Reservation Station?
They don't, but AFAIK there is no unfused-domain front-end counter. If you don't have branch mispredicts that cause uops to be discarded from the RS after issue/before exec, it doesn't matter where in the pipeline you count so there is a workaround.
To count RS uops, use uops_executed.thread
which counts uops that have successfully(?) executed. I haven't checked if replays of eagerly-dispatched uops count uops_executed
on every attempted dispatch, or only on uops_dispatched_port.port_[0..7]
.
See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of using perf
to sort out eliminated vs. non-eliminated, and front-end fused domain vs. back-end unfused domain.
I just measure the number of microoperations getting into the RS with the event
uops_issued.any
That event counts fused-domain uops issued into the ROB. It counts 1 for micro-fused uops like add eax, [rdi]
or mov al, [rsi]
that merges into the low half of RAX (even though those count 2 uops_executed), and it counts 1 for eliminated uops like mov reg,reg
and xor same,same
(0 uops_executed).
perf list
does misleadingly describe it like this (on Skylake) so the confusion is understandable.
uops_issued.any
[Uops that Resource Allocation Table (RAT) issues to Reservation Station (RS)]
I always thought that instructions for killing dependencies, e.g xor reg, reg do not have to be executed and are ready for retirement as soon as the Renamer moves them to the Re-order Buffer.
Yes, that's what I think, too, that they enter the ROB marked as already executed, and don't touch the RS.
Only Sandybridge-family does this (including Skylake/IceLake); other microarchitectures (like Zen AFAIK) do need a back-end uop to actually write the zero. What is the best way to set a register to zero in x86 assembly: xor, mov or and?
AMD does do mov-elimination for vector moves (since Bulldozer) and GP-integer moves since Zen, so those are presumably handled like Intel xor-zeroing or mov
.
One guess at the mechanism on Sandybridge is that xor-zeroing (of GP-integer or XMM/YMM registers) renames onto an internal zero register. http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ tested this and it xor-zeroing instructions don't consume an extra PRF entry for writing the destination register.