Out-of-order exec and cache access vs. store-forwarding may take significant power. You can't usefully model power by assigning 1 number to each opcode and addressing mode. Every cycle the CPU isn't asleep costs significantly more power than an integer ALU execution unit, so you need to model performance.
There are many other factors, too, like uop cache hits reducing energy usage in the front-end. (Legacy decode costs power.) IDK how much it matters whether the ROB or RS are nearly full or nearly empty; I could imagine a nearly-empty RS is cheaper to scan for instructions ready to execute. See the block diagram of a single core in https://www.realworldtech.com/haswell-cpu/6/ and note how much stuff there is apart from the execution units.
"Race to sleep" is a key concept: more efficient code can finish sooner and let the whole core go back into a sleep state.
Related:
That doesn't mean it's impossible to say anything, though:
Energy per cycle does increase with IPC (more execution units active, and more logic dispatching uops to execution units and bypass-forwarding results to physical registers).
But between different instructions, there's probably very little difference between different ALU uops like setcc
vs. sub
vs. cmp
. sub
and cmp
are literally the same ALU operation, just with cmp
only writing FLAGS vs. sub
also writing an integer register. An integer physical register-file entry can hold both an integer reg value and the FLAGS produced by the same instruction, which makes sense as a design choice because most x86 integer instructions write FLAGS.
Some scalar integer ALU instructions might use a bit more energy, like imul
and maybe some other 3-cycle latency instructions that only run on port 1 (popcnt
, pdep
, maybe lzcnt
/tzcnt
). IDK how efficient a barrel shifter is vs. an adder-subtractor, but 64-bit shifts might use a little bit more.
I'd expect differences when you're executing more back-end uops, e.g. a memory-source add decodes to a micro-fused uop for the front-end and ROB, but in the RS it's separate load and add uops for execution ports. (Micro fusion and addressing modes)
Different forms of mov
(load, store, reg-to-reg) are obviously very different, with mov-elimination helping some with power in reg-reg moves of 32 or 64-bit.
SIMD is where some instructions really start to cost significantly more energy
Especially when SIMD multipliers are active. The highest-power workload on a Skylake-family CPU like yours is 2x 256-bit FMAs per clock, probably with some cache-hit loads/stores happening, e.g. as memory source operands. (e.g. Prime95 stress test).
Between different 1-cycle-latency integer ALU instructions, probably very little difference, likely not measurable if the same number of instructions per cycle are executing. Of course, anti-optimized debug builds like you're showing are full of store/reload bottlenecks that kill IPC.