Energy consumption per x86 instruction?
Asked Answered
F

1

6

I am aware of a few tools that measure power consumption of programs, such as powerTOP, RAPL and the like.

However, I was wondering if there exists some kind of benchmark such as Agner Fog's benchmark of CPU's https://www.agner.org/optimize/instruction_tables.pdf which measure the energy consumption per instruction?

Let's say I have the following instructions

    movq    %rdi, -8(%rbp)
    movq    %rsi, -16(%rbp)
    movq    -8(%rbp), %rdx
    movq    -16(%rbp), %rax
    cmpq    %rax, %rdx
    setb    %al

and I only wish to look at the instructions such as movq, cmpq and setb to estimate the power consumption of the program. I am on an Intel i5 10400 processor, but I am maybe looking for broader benchmarks of different microarchitectures.
Is this even possible?

Forespent answered 12/9, 2022 at 16:0 Comment(7)
Isn't the i5-10400 a desktop processor? Wouldn't it make way more sense to optimize for power consumption on laptop or other mobile processors?Fatten
@JosephSible-ReinstateMonica That is a good point. Yes, however, I am trying to estimate power consumption of programs on a desktop (a theoretical one).Forespent
@JosephSible-ReinstateMonica: Intel uses the same core microarchitecture across ULV-mobile / server / desktop. I'd expect power as a fraction of max TDP to be similar across different frequency/voltage operating points for the same workload, ignoring leakage current (static power which doesn't depend on frequency).Cinthiacintron
The real problem is that out-of-order exec and cache access vs. store-forwarding may take significant power. You can't usefully model power by assigning 1 number to each opcode and addressing mode. Every cycle the CPU isn't asleep costs power, too, so you need to model performance. As well as uop cache hits which reduce energy usage in the front-end. (Legacy decode costs power.) IDK how much it matters whether the ROB or RS are nearly full or nearly empty; I could imagine a nearly-empty RS is cheaper to scan for instructions ready to execute.Cinthiacintron
Related: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? and realworldtech.com/sandy-bridge re: performance. Also lighterra.com/papers/modernmicroprocessors is essential reading, also the concept of "race to sleep" (more efficient code can finish sooner and get back to sleep.)Cinthiacintron
@PeterCordes Thank you for your input. I fully agree -- Analyzing opcode by assigning some value to each is no where near a realistic modeling of the problem at hand. However, I wish to attempt this and combine it with benchmarks. I plan to use perf to measure cache misses, TLB-misses and the like, and mount a device which can accurately measure the power drawn from an edge device. However, I wish to get some sense before run-time of the program to assess the power needed to run the program.Forespent
Power per cycle does increase with IPC, but especially when SIMD multipliers are active. The highest-power workload on Skylake-family CPUs is 2x 256-bit FMAs per clock, probably with some cache-hit loads/stores happening, e.g. as memory source operands. (e.g. Prime95 stress test). Between different 1-cycle-latency integer ALU instructions, probably very little difference, likely not measurable if the same number of instructions per cycle are executing. Of course, anti-optimized debug builds like you're showing are full of store/reload bottlenecks that kill IPC.Cinthiacintron
C
5

Out-of-order exec and cache access vs. store-forwarding may take significant power. You can't usefully model power by assigning 1 number to each opcode and addressing mode. Every cycle the CPU isn't asleep costs significantly more power than an integer ALU execution unit, so you need to model performance.

There are many other factors, too, like uop cache hits reducing energy usage in the front-end. (Legacy decode costs power.) IDK how much it matters whether the ROB or RS are nearly full or nearly empty; I could imagine a nearly-empty RS is cheaper to scan for instructions ready to execute. See the block diagram of a single core in https://www.realworldtech.com/haswell-cpu/6/ and note how much stuff there is apart from the execution units.

"Race to sleep" is a key concept: more efficient code can finish sooner and let the whole core go back into a sleep state.

Related:

That doesn't mean it's impossible to say anything, though:

Energy per cycle does increase with IPC (more execution units active, and more logic dispatching uops to execution units and bypass-forwarding results to physical registers).

But between different instructions, there's probably very little difference between different ALU uops like setcc vs. sub vs. cmp. sub and cmp are literally the same ALU operation, just with cmp only writing FLAGS vs. sub also writing an integer register. An integer physical register-file entry can hold both an integer reg value and the FLAGS produced by the same instruction, which makes sense as a design choice because most x86 integer instructions write FLAGS.

Some scalar integer ALU instructions might use a bit more energy, like imul and maybe some other 3-cycle latency instructions that only run on port 1 (popcnt, pdep, maybe lzcnt/tzcnt). IDK how efficient a barrel shifter is vs. an adder-subtractor, but 64-bit shifts might use a little bit more.

I'd expect differences when you're executing more back-end uops, e.g. a memory-source add decodes to a micro-fused uop for the front-end and ROB, but in the RS it's separate load and add uops for execution ports. (Micro fusion and addressing modes)

Different forms of mov (load, store, reg-to-reg) are obviously very different, with mov-elimination helping some with power in reg-reg moves of 32 or 64-bit.

SIMD is where some instructions really start to cost significantly more energy

Especially when SIMD multipliers are active. The highest-power workload on a Skylake-family CPU like yours is 2x 256-bit FMAs per clock, probably with some cache-hit loads/stores happening, e.g. as memory source operands. (e.g. Prime95 stress test).

Between different 1-cycle-latency integer ALU instructions, probably very little difference, likely not measurable if the same number of instructions per cycle are executing. Of course, anti-optimized debug builds like you're showing are full of store/reload bottlenecks that kill IPC.

Cinthiacintron answered 12/9, 2022 at 18:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.