Do sse instructions consume more power/energy?
Asked Answered
C

3

17

Very simple question, probably difficult answer:

Does using SSE instructions for example for parallel sum/min/max/average operations consume more power than doing any other instructions (e.g. a single sum)?

For example, on Wikipedia I couldn't find any information in this respect.

The only hint of an answer I could find is here, but it's a little bit generic and there is no reference to any published material in this respect.

Cyclostome answered 1/11, 2013 at 8:5 Comment(4)
I think if you manage to keep the CPU maximally busy it needs more power. But it's trickier than just SSE vs. scalar. There are several SSE units you can keep busy at the same time, your code shouldn't wait too much on memory access etc. And if you do the same amount of work with scalar/SIMD instructions the SIMD instructions will probably be faster, reducing the total energy consumption.Crowns
It really depends on how optimised the code is - heavily optimised code keeps more execution units busy and spends less time waiting on e.g. load stalls, and therefore takes more power. I've seen big increases in power consumption (and CPU temperature!) when running heavily optimised SIMD code relative to normal scalar code.Incommunicado
@CodesInChaos, I doubt the same amount of work is done for scalar/SIMD. SIMD uses more transistors to do the work faster in parallel. It's possible that being four times faster uses more than 4 times as much power. It's also possible that SIMD shares some logic for parallel calculations and is more optimized than say x87 logic so it uses less power. But that's a separate point.Yazbak
I'm surprised none of the answers mentioned the concept of race to sleep: finish the computation quickly so you can spend more time in low-power idle. Low-power idle saves more power than the difference between code with lots of branch-mispredicts and cache-misses vs. code that saturates the 256b FMA units. The more transistors switch every clock cycle, the more power your CPU uses. It's such a big deal that Haswell Xeons have a lower max clock speed when AVX code is active.Desiccator
P
32

I actually did a study on this a few years ago. The answer depends on what exactly your question is:

In today's processors, power consumption is not much determined by the type of instruction (scalar vs. SIMD), but rather everything else such as:

  1. Memory/cache
  2. Instruction decoding
  3. OOE, register file
  4. And lots others.

So if the question is:

All other things being equal: Does a SIMD instruction consume more power than a scalar instruction.

For this, I dare to say yes.

One of my graduate school projects eventually became this answer: A side-by-side comparison of SSE2 (2-way SIMD) and AVX (4-way SIMD) did in fact show that AVX had a noticably higher power consumption and higher processor temperatures. (I don't remember the exact numbers though.)

This is because the code is identical between the SSE and the AVX. Only the width of the instruction was different. And the AVX version did double the work.

But if the question is:

Will vectorizing my code to use SIMD consume more power than a scalar implementation.

There's numerous factors involved here so I'll avoid a direct answer:

Factors that reduce power consumption:

  • We need to remember that the point of SIMD is to improve performance. And if you can improve performance, your app will take less time to run thus saving you power.

  • Depending on the application and the implementation, SIMD will reduce the number instructions that are needed to do a certain task. That's because you're doing several operations per instruction.

Factors that increase power consumption:

  • As mentioned earlier, SIMD instructions do more work and can use more power than scalar equivalents.
  • Use of SIMD introduces overhead not present in scalar code (such as shuffle and permute instructions). These also need to go through the instruction execution pipeline.

Breaking it down:

  • Fewer instructions -> less overhead for issuing and executing them -> less power
  • Faster code -> run less time -> less power
  • SIMD takes more power to execute -> more power

So SIMD saves you power by making your app take less time. But while its running, it consumes more power per unit time. Who wins depends on the situation.

From my experience, for applications that get a worthwhile speedup from SIMD (or anything other method), the former usually wins and the power consumption goes down.

That's because run-time tends to be the dominant factor in power consumption for modern PCs (laptops, desktops, servers). The reason being that most of the power consumption is not in the CPU, but rather in everything else: motherboard, ram, hard drives, monitors, idle video cards, etc... most of which have a relatively fixed power draw.

For my computer, just keeping it on (idle) already draws more than half of what it can draw under an all-core SIMD load such as prime95 or Linpack. So if I can make an app 2x faster by means of SIMD/parallelization, I've almost certainly saved power.

Paring answered 1/11, 2013 at 8:18 Comment(9)
Very interesting experiment! In my case, using SSE instructions I would redesign my algorithm to perform MORE operations than for the scalar version, but seemingly in less time. So, as I feared, if I stay low on CPU usage (defined as percentage of time I am using it) it doesn't mean I will consume less power, and I might be draining my battery faster. I'll have to implement and benchmark it.Cyclostome
@Mystical, it depends on if the speed up is really linear. You could use the same analogy for car to get between point a and b but we know that the friction is not linear (wind resistance goes more like velocity^2) so going slower saves more power (which is why they lowered speed limits to 55 MPG in the 70s). Especially since the parallel speed up is probably not 100% So it's quite possible that an algorithm which is 3.7 times quicker with SSE ends up using more power than the slower algorithm without SSE.Yazbak
@Cyclostome You will probably find that run-time of the algorithm will be the primary factor of power consumption. The reason is that the difference between an idle machine and one under full load is rarely over 100W on a typical desktop. Whereas the motherboard + power supply + hard drives + everything else tends to add up to much more than that. On my machine for example, the baseline cost of keeping the computer powered on is 280W. Under full SIMD load on all cores, it barely gets up to 350W. So more than half the power usage is overhead.Paring
@Paring If we go to devices different from a desktop PC or a server, my guess is that in a machine that is designed to be low power consumption (like a smart-phone or a tablet), together with the screen the CPU is the most prominent energy eater. Am I far from reality?Cyclostome
@Cyclostome That's definitely reasonable. I have no experience with non-desktop/servers. So I can't say anything about embedded devices.Paring
Small correction. I did not mean if the speed up is linear. I meant if the power use vs. the speed up is linear.Yazbak
"So if I can make an app 2x faster by means of SIMD/parallelization, I've almost certainly saved power." You don't seem to be using power correctly here, you aren't saving power, you are saving energy (power*time). Energy is almost certainly what we care about here though. If it is running for an extended period of time, the frequency/voltage of the processor will scale down to reduce the power consumption, resulting in similar power levels but reduced run-time and reduced energy consumption.Blancheblanchette
@Blancheblanchette You are right. I have a habit of (incorrectly) using the words "energy" and "power" interchangeably.Paring
I think the right value to look at is Performance per watt (en.wikipedia.org/wiki/Performance_per_watt). Or rate of computation per watt. It's often measured in FLOPS/watt.Yazbak
T
7

As Mystical's answer suggests, SIMD code tends to take slightly more power, but if the problem is amenable to vectorization, well-written SIMD code will run significantly faster; the speedup is almost always larger than the increase in power, which results in a decrease in the amount of energy (the integral of power over time) consumed.

This is broadly true not only for SIMD vectorization, but for nearly all optimization. Faster code is not just faster, but (almost universally) more energy efficient.

A nit about terminology: people frequently about "power" when they really want to talk about "energy". Power consumption in computing is really only relevant if you are engineering power supplies (obvious reasons) or engineering enclosures (because you want to know how much power you need to be able to disperse as heat). 99.999% of people aren't engaged in either one of those activities, and thus they really want to be keeping energy in mind (as computation / energy is the correct measure of how efficient a program is).

Terryterrye answered 4/11, 2013 at 3:27 Comment(6)
I think the common metric that is used is Computation rate/power or FLOPS/watt. However, FLOPS is computations/second. And watt is energy/second so FLOPS/watt is really computations/energy.Yazbak
Exactly as you say, the time terms in FLOP/S/Watt cancel, leaving you with computation / energy. It’s unfortunate and slightly confusing that FLOP/S/Watt is so widely quoted; Ops/Joule is a better (but rarely used) way to name the same unit.Terryterrye
Yes, but FLOPS/watt (computations/joule) does not tell you anything about speed, it only tells you about energy use. If the goal is to lower the energy use then FLOPS/watt would suggest lowering the frequency (or voltage) as much as possible (only considering the core and ignoring all other costs of energy). That's why I'm surprised that FLOP/watt (or FLOPS/joule) is not a common metric. I mean I think the rate of computation/joule is more interesting than the computations/joule.Yazbak
If you are sticking to one ISA (say x86 or ARM), a good metric is EPI or Energy per Instruction. A good paper on the subject by Ed Grochowski of Intel: intel.com/pressroom/kits/core2duo/pdf/epi-trends-final2.pdfBlancheblanchette
Your nit is misguided. As microprocessors only get their energy via electrical power, and "waste" heat are measured in Watts, it makes no sense to ignore the time (speed) element necessary to perform said operations, by focusing on Joules vs. Watts. Otherwise the abacus wins. That might change for optical rather than electrical, or Quantum computing, but for electronic semiconductors, power is the rational unit.Englishman
@mctylr: Waste heat is measured in Joules. The instantaneous rate of waste heat transfer is measured in Watts. The proper measure of efficiency is "how much energy is required to perform a fixed computation". Contrary to expectations, this favors getting the computation done faster because of static power considerations. The abacus does not win.Terryterrye
Y
3

This really depends on what you really want to know. Let me answer this question from the point of view of what I think a processor designer who may not care about all other power consumption (e.g. main memory) but only wants to know the power consumption in his/her piece of logic in a single core. I have two answers then.

1.) For a fixed frequency, a core with SIMD which gives a faster result likely uses more energy than a scalar core due to the extra complexity (circuit logic) of implementing SIMD.

2.) If the frequency is allowed to vary so that the scalar core finishes in the same time as the SIMD core I would argue that the SIMD core uses much less energy.

Edit: I changed the words power to energy since power is energy/time. I think the proper thing to compare is something like FLOPS/watt

Let me explain. The power of a processor goes as C*V^2*f where C is capacitance, V is voltage, and f is frequency. If you read this paper Optimizing Power using Transformations you can show that using two cores at half the frequency uses only 40% of the power of a single core at full frequency to the same calculation in the same amount of time.

I would argue that the same logic applies to other parallel methods such as SIMD and ILP (super-scalar). So instead of increasing the frequency with a scalar core if SIMD is implemented the same computation can be done in the same amount of time using much less energy (on the other had it makes the programming a lot more difficult).

GPU developers have used the principle of that paper to put them a few years ahead of Intel (by Moore's law) in processing potential. They run at lower frequencies than CPUs and use far more "cores" so for the same amount of electrical energy they get more potential processing power.

Yazbak answered 1/11, 2013 at 12:26 Comment(4)
I think it's a nice paper, but it dates back to 1995.Cyclostome
@Antonio, do you have more modern paper you can recommend? I think I found that paper from the OpenCL book.Yazbak
No, I don't. But sometimes you can try to check among paper citing it... scholar.google.de/…Cyclostome
Comparing a vectorized with non-vectorized implementation of the same algorithm on the same CPU: if the vectorized one runs in 1/4 the time (perfect SIMD speedup), it probably uses more power while computing, but not 4x as much. So it uses less total energy for the computation. If that's all the work there was, then it spends the rest of the time in a low-power sleep state (race-to-sleep) while the core running the scalar code is still using fairly high power tracking the out-of-order execution of all the scalar operations.Desiccator

© 2022 - 2024 — McMap. All rights reserved.