assembly x86 intel cpu-architecture energy

Energy consumption per x86 instruction?

I am aware of a few tools that measure power consumption of programs, such as powerTOP, RAPL and the like.

However, I was wondering if there exists some kind of benchmark such as Agner Fog's benchmark of CPU's https://www.agner.org/optimize/instruction_tables.pdf which measure the energy consumption per instruction?

Let's say I have the following instructions

    movq    %rdi, -8(%rbp)
    movq    %rsi, -16(%rbp)
    movq    -8(%rbp), %rdx
    movq    -16(%rbp), %rax
    cmpq    %rax, %rdx
    setb    %al

and I only wish to look at the instructions such as movq, cmpq and setb to estimate the power consumption of the program. I am on an Intel i5 10400 processor, but I am maybe looking for broader benchmarks of different microarchitectures.
Is this even possible?

Solution

Out-of-order exec and cache access vs. store-forwarding may take significant power. You can't usefully model power by assigning 1 number to each opcode and addressing mode. Every cycle the CPU isn't asleep costs significantly more power than an integer ALU execution unit, so you need to model performance.

There are many other factors, too, like uop cache hits reducing energy usage in the front-end. (Legacy decode costs power.) IDK how much it matters whether the ROB or RS are nearly full or nearly empty; I could imagine a nearly-empty RS is cheaper to scan for instructions ready to execute. See the block diagram of a single core in https://www.realworldtech.com/haswell-cpu/6/ and note how much stuff there is apart from the execution units.

"Race to sleep" is a key concept: more efficient code can finish sooner and let the whole core go back into a sleep state.

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
http://www.lighterra.com/papers/modernmicroprocessors/ is essential reading
https://en.wikichip.org/wiki/File:Intel_Architecture,_Code_Name_Skylake_Deep_Dive-_A_New_Architecture_to_Manage_Power_Performance_and_Energy_Efficiency.pdf - slides from the IDF2015 talk on power-management and efficiency has lots of details about Skylake frequency/voltage considerations, but relative power of different things is probably pretty similar at different voltage/frequency levels. Except that at lower voltage, static power (leakage current) is a larger fraction of total power.

That doesn't mean it's impossible to say anything, though:

Energy per cycle does increase with IPC (more execution units active, and more logic dispatching uops to execution units and bypass-forwarding results to physical registers).

But between different instructions, there's probably very little difference between different ALU uops like setcc vs. sub vs. cmp. sub and cmp are literally the same ALU operation, just with cmp only writing FLAGS vs. sub also writing an integer register. An integer physical register-file entry can hold both an integer reg value and the FLAGS produced by the same instruction, which makes sense as a design choice because most x86 integer instructions write FLAGS.

Some scalar integer ALU instructions might use a bit more energy, like imul and maybe some other 3-cycle latency instructions that only run on port 1 (popcnt, pdep, maybe lzcnt/tzcnt). IDK how efficient a barrel shifter is vs. an adder-subtractor, but 64-bit shifts might use a little bit more.

I'd expect differences when you're executing more back-end uops, e.g. a memory-source add decodes to a micro-fused uop for the front-end and ROB, but in the RS it's separate load and add uops for execution ports. (Micro fusion and addressing modes)

Different forms of mov (load, store, reg-to-reg) are obviously very different, with mov-elimination helping some with power in reg-reg moves of 32 or 64-bit.

SIMD is where some instructions really start to cost significantly more energy

Especially when SIMD multipliers are active. The highest-power workload on a Skylake-family CPU like yours is 2x 256-bit FMAs per clock, probably with some cache-hit loads/stores happening, e.g. as memory source operands. (e.g. Prime95 stress test).

Between different 1-cycle-latency integer ALU instructions, probably very little difference, likely not measurable if the same number of instructions per cycle are executing. Of course, anti-optimized debug builds like you're showing are full of store/reload bottlenecks that kill IPC.