SIMD instructions lowering CPU frequency
Asked Answered
A

2

61

I read this article. It talked about why AVX-512 instruction:

Intel’s latest processors have advanced instructions (AVX-512) that may cause the core, or maybe the rest of the CPU to run slower because of how much power they use.

I think on Agner's blog also mentioned something similar (but I can't find the exact post).

I wonder what other instructions supported by Skylake have the similar effect that they will lower the power to maximize the throughput later? All the v prefixed instructions (such as vmovapd, vmulpd, vaddpd, vsubpd, vfmadd213pd)?

I am trying to compile a list of instructions to avoid when compiling my C++ application for Xeon Skylake.

Anoxia answered 2/7, 2019 at 12:45 Comment(16)
instructions to avoid in order to accomplish what exactly?Verlinevermeer
@500-InternalServerError in order to avoid jitters in the system. Think about a laser arm gets jitters.Anoxia
Trevis Down (aka Beeonrope on OS) wrote about this in the comments in this post and continued the discussion here. He found that each ties (scalar, AVX/AVX2, AVX-512) has "cheap" (no FP, simple operations) instructions and "heavy" instruction. Cheap instructions drop the frequency to the one of the next higher tier (e.g. cheap AVX-512 inst use the AVX/AVX2 tier) even if used sparsely. Heavy inst must be used more than 1 every ...Othilie
... two cycles and drop the frequency according to their tier (e.g. AVX-512 heavy instrs drop the frequency to the AV-512 base). Travis also shared the code he used to test here. You can find the behaviour of each instruction with a bit of patience or by his rule of thumb. Finally note that this frequency scaling is a problem iif the ratio of vector to scalar instruction is low enough so that the drop in frequency is not balanced by the bigger width at which data is processed. Check the final binary to see if you really gained anything.Othilie
@MargaretBloom thanks for sharing your thought and all the links. I also read Beeonrope's post about the penalty in ld. Given my ld is very old, I think it is best for me to avoid AVX and AVX512 related instructions. And as you pointed out, the ratio of vector to scalar is also important. Given I write high level C++ code, it is hard to figure the ratio unless I check the assembly output each time, slowing the the development...Anoxia
@Anoxia You can make three builds, one without AVX, one with AVX/AVX2 and one with AVX-512 (if applicable) and profile them. Then take the fastest one.Othilie
@Anoxia - you can avoid the ldd related penalty by issuing a vzeroupper at the start of your program.Dolf
@Dolf Based on your answer, is there anyway to tell GCC not to generate any AVX-512 and AVX-256-heavy instructions? But all other instructions are okay.Anoxia
Peter mentioned the -mpreferred-vector-width=256 option. I don't know if it prevents gcc from ever producing AVX-512 instructions (outside of direct intrinsic use), but it is certainly possible. I am not aware of any option which distinguishes between "heavy" and "light" instructions however. Usually this isn't a problem, since if you turn off AVX-512 and don't have a bunch of FP ops, you are probably targeting L0 anyways, and AVX-512 light is still L1.Dolf
Try those options and then check if any L1/L2 instructions pop up using the performance counter events for L1 and L2 licenses.Dolf
I can try it now. But is there a way to check whether L1/L2 instructions are in the binary?Anoxia
I tried to compile with -march=skylake-avx512 -mtune=skylake-avx512 -mprefer-vector-width=128, and then I decompiled it objdump -d my binary > binary.asm, and then grep -i ymm binary.asm. I guess it is safe to conclude that it doesn't use any 256 and 512 bit registers and so no AVX-256 and AVX-512 instructions are emitted? @Dolf Tho, I still see many vzeroupper instructions. I thought it were only used with ymm registers. No?Anoxia
Yeah that's a reasonable way to check the binary. Keep in mind at runtime you'll likely use other libraries, at a minimum libc - and these have 256-bit instructions, eg in their memcpy implementation. So you really have to do a runtime check to be sure you aren't executing any "forbidden" instructions. I don't the 256b instructions in libc are likely to be a problem wrt the licenses since they are light.Dolf
Yeah vzeroupper makes more sense after using umm registers, to avoid transition penalties for "dirty upper" and probably isn't needed for xmm only code. I think there is a flag to turn it's emission off.Dolf
@Dolf you brought up an interesting point -- "other libraries, at a minimum libc - and these have 256-bit instructions". I thought most libraries come with the Linux distros were not compiled for a specific x86 CPU, and some x86 CPUs don't have AVX 256 support and so library like libc shouldn't have any 256-bit instructions. No?Anoxia
@Anoxia important routines in libc are generally compiled multiple times for different ISAs and then the version appropriate for the current CPU is selected at runtime using the dynamic loader's IFUNC capability. So you'll usually get a version optimized for your CPU (unless your libc is quite old and your CPU quite new).Dolf
D
89

On Intel chips, the frequency impact and the specific frequency transition behavior depends on both the width of the operation and the specific instruction used.

As far as instruction-related frequency limits go, there are three frequency levels – so-called licenses – from fastest to slowest: L0, L1 and L2. L0 is the "nominal" speed you'll see written on the box: when the chip says "3.5 GHz turbo", they are referring to the single-core L0 turbo. L1 is a lower speed sometimes called AVX turbo or AVX2 turbo5, originally associated with AVX and AVX2 instructions1. L2 is a lower speed than L1, sometimes called "AVX-512 turbo".

The exact speeds for each license also depend on the number of active cores. For up to date tables, you can usually consult WikiChip. For example, the table for the Xeon Gold 5120 is here:

Xeon Gold 5120 Frequencies

The Normal, AVX2 and AVX512 rows correspond to the L0, L1 and L2 licenses respectively. Note that the relative slowdown for L1 and L2 licenses generally gets worse as the number of cores increase: for 1 or 2 active cores the L1 and L2 speeds are 97% and 91% of L0, but for 13 or 14 cores they are 85% and 62% respectively. This varies by chip, but the general trend is usually the same.

Those preliminaries out of the way, let's get to what I think you are asking: which instructions cause which licenses to be activated?

Here's a table, showing the implied license for instructions based on their width and their categorization as light or heavy:

   Width    Light   Heavy  
 --------- ------- ------- 
  Scalar    L0      N/A
  128-bit   L0      L0     
  256-bit   L0      L1*    
  512-bit   L1      L2*

*soft transition (see below)

So we immediately see that all scalar (non-SIMD) instructions and all 128-bit wide instructions2 always run at full speed in the L0 license.

256-bit instructions will run in L0 or L1, depending on whether they are light or heavy, and 512-bit instructions will run in L1 or L2 on the same basis.

So what is this light and heavy thing?

Light vs Heavy

It's easiest to start by explaining heavy instructions.

Heavy instructions are all SIMD instructions that need to run on the FP/FMA unit. Basically that's the majority of the FP instructions (those usually ending in ps or pd, like addpd) as well as integer multiplication instructions which largely start with vpmul or vpmad since SIMD integer multiplication actually runs on the SIMD unit, as well as vplzcnt(q|d) which apparently also runs on the FMA unit.

Given that, light instructions are everything else. In particular, integer arithmetic other than multiplication, logical instructions, shuffles/blends (including FP) and SIMD load and store are light.

Transitions

The L1 and L2 entries in the Heavy column are marked with an asterisk, like L1*. That's because these instructions cause a soft transition when they occur. The other L1 entry (for 512-bit light instructions) causes a hard transition. Here we'll discuss the two transition types.

Hard Transition

A hard transition occurs immediately as soon as any instruction with the given license executes4. The CPU stops, takes some halt cycles and enters the new mode.

Soft Transition

Unlike hard transitions, a soft transition doesn't occur immediately as soon as any instruction is executed. Rather, the instructions initially execute with a reduced throughput (as slow as 1/4 their normal rate), without changing the frequency. If the CPU decides that "enough" heavy instructions are executing per unit time, and a specific threshold is reached, a transition to the higher-numbered license occurs.

That is, the CPU understands that if only a few heavy instructions arrive, or even if many arrive but they aren't dense when considering other non-heavy instructions, it may not be worth reducing the frequency.

Guidelines

Given the above, we can establish some reasonable guidelines. You never have to be scared of 128-bit instructions, since they never cause license related3 downclocking.

Furthermore, you never have to be worried about light 256-bit wide instructions either, since they also don't cause downclocking. If you aren't using a lot of vectorized FP math, you aren't likely to be using heavy instructions, so this would apply to you. Indeed, compilers already liberally insert 256-bit instructions when you use the appropriate -march option, especially for data movement and auto-vectorized loops.

Using heavy AVX/AVX2 instructions and light AVX-512 instructions is trickier, because you will run in the L1 licenses. If only a small part of your process (say 10%) can take advantage, it probably isn't worth slowing down the rest of your application. The penalties associated with L1 are generally moderate - but check the details for your chip.

Using heavy AVX-512 instructions is even trickier, because the L2 license comes with serious frequency penalties on most chips. On the other hand, it is important to note that only FP and integer multiply instructions fall into the heavy category, so as a practical matter a lot of integer 512-bit wide use will only incur the L1 license.


1 Although, as we'll see, this a bit of a misnomer because AVX-512 instructions can set the speed to this license, and some AVX/2 instructions don't.

2 128-bit wide means using xmm registers, regardless of what instruction set they were introduced in - mainstream AVX-512 contains 128-bit variants for most/all new instructions.

3 Note the weasel clause license related - you may certainly suffer other causes of downclocking, such as thermal, power or current limits, and it is possible that 128-bit instructions could trigger this, but I think it is fairly unlikely on a desktop or server system (low power, small form factor devices are another matter).

4 Evidently, we are talking only about transitions to a higher-level license, e.g., from L0 to L1 when a hard-transition L1 instruction executes. If you are already in L1 or L2 nothing happens - there is no transition if you are already in the same level and you don't transition to lower-numbered levels based on any specific instruction but rather running for a certain time without any instructions of the higher-numbered level.

5 Out of the two AVX2 turbo is more common, which I never really understood because 256-bit instructions are as much associated with AVX as compared to AVX2, and most of the heavy instructions which actually trigger AVX turbo (L1 license) are actually FP instructions in AVX, not AVX2. The only exception is AVX2 integer multiplies.

Dolf answered 3/7, 2019 at 0:2 Comment(15)
Comments are not for extended discussion; this conversation has been moved to chat.Severen
Interesting. vplznctd/q on the FMA unit makes sense, though: it needs bit-scan hardware to renormalize the results of FP math by finding the MSB of the significand result.Kortneykoruna
@PeterCordes - yeah I saw this here which links a comprehensive test for all AVX-512 instructions. There is something weird about it though, as described in the comments on that tweet: although the 256-bit version is clearly "heavy", the 512-bit version seems to be mostly light according to this test. However, the test may simply not be triggering L2 because the instructions aren't dense enough.Dolf
Interestingly, the dumps pointed by the Twitter post seem to suggest that all integer multiplies are actually 'light', except for VPMULLD - am I reading it right?Trichinize
When did these licenses first appear? I don't remember this issue with Sandy Bridge or Ivy Bridge. Did it exist with Haswell? Maybe AVX2 turbo is used because no system with AVX only had a SSE and AVX frequency?Unguent
I don't think the kinds of instructions is a suffient metric for determining the license (frequency level). I wrote code to indirectly measure the frequency. I noticed that for AVX and AVX512 the the frequency only scaled down if the ports had sufficient load. For example if you have a dependency chain which is latency bound and therefore only does one AVX512 FMA every 5 clock cycles (or whatever the latency of FMA is) then the frequency does not scale down i.e. it stays license L0. See the update to my answer here https://mcmap.net/q/14892/-how-can-i-programmatically-find-the-cpu-frequency-with-cUnguent
@Zboson: That's the difference between "hard" and "soft" transitions described above. Your results seem to show the AVX512 running at L1 or L2 depending on load, not L0.Trichinize
@zinga, you're right, I should have read the answer more carefully. I got confused by heavy vs. hard and light vs. soft. "the CPU understands that if only a few heavy instructions arrive, or even if many arrive but they aren't dense when considering other non-heavy instructions, it may not be worth reducing the frequency."Unguent
@Zboson - yeah it would be interesting to explore exactly when the transition takes place. I think @ Mysticial has said that the transition function isn't that smart, i.e., that it decides to make a transition to a slower speed even when the stready-state code will objectively run slower after the transition (e.g., code with a 50/50 mix of FMA and non-FMA would be better of not transitioning since you only need ~1 FMA, you can get in the faster license but instead it transitions).Dolf
@Zboson - I think it first showed up in the Haswell server chips, i.e., Haswell-EP or whatever it was called. The name AVX2 turbo speed never made much sense to me: it mostly affects FP instructions from the AVX set, not AVX2 which was mostly integer (integer mul is an exception). Intel themselves use AVX, not AVX2 in early documents. People seem to like to call it AVX2 though, maybe because it came out in Haswell where AVX2 was the new ISA?Dolf
@Dolf Might be changing for Saphire Rapids. There don't appear to be any license transition events anymore.Unvalued
@Unvalued these licences don't exist anymore?Tumbledown
@Tumbledown not AFAICT (old link is dead, the moved the files), but still don't see any license transition events for SPR on their github page. Don't have an SPR machine onhand so can't test.Unvalued
@Tumbledown also FWIW, I know we plan to enable avx512 by default for SPR in GLIBC because the freq throttling isn't a concern although thats only for the "light" instructions.Unvalued
@Unvalued - that's pretty interesting, though it's not 100% clear if the events disapearing means that the licenses truly are gone.Dolf
K
17

It's not the instruction mnemonic that matters, it's 512-bit vector width at all that matters.

You can use the 256-bit version of AVX-512VL instructions, e.g. vpternlogd ymm0, ymm1, ymm2 without incurring the AVX-512 turbo penalty.

Related: Dynamically determining where a rogue AVX-512 instruction is executing is about a case where one AVX-512 instruction in glibc init code or something left a dirty upper ZMM that gimped max turbo for the rest of the process lifetime. (Or until a vzeroupper maybe)

Although there can be other turbo impacts from light / heavy use of 256-bit FP math instructions, and some of that is due to heat. But usually 256-bit is worth it on modern CPUs.

Anyway, this is why gcc -march=skylake-avx512 defaults to -mprefer-vector-width=256. For any given workload, it's worth trying -mprefer-vector-width=512 and maybe also 128, depending on how much or how little of the work can usefully auto-vectorize.

Tell GCC to tune for your CPU (e.g. -march=native) and it will hopefully make good choices. Although on a desktop Skylake-X, the turbo penalty is smaller than a Xeon. And if your code does actually benefit from 512-bit vectorization, it can be worth it to pay the penalty.

(Also beware the other major effect of Skylake-family CPUs going into 512-bit vector mode: the vector ALUs on port 1 shut down, so only scalar instructions like popcnt or add can use port 1. So vpand and vpaddb etc. throughput drops from 3 to 2 per clock. And if you're on an SKX with two 512-bit FMA units, the extra one on port 5 powers up, so then FMAs compete with shuffles.)

Kortneykoruna answered 2/7, 2019 at 20:34 Comment(21)
I have been using -march=generic for a long time for my binary. So I think even -march=skylake-avx512 -mpreferred-vector-width=128 would make some optimization kick in without the heavy penalty from using avx-256 (as I ask for 128). Thought?Anoxia
@HCSF: Well sure, skylake + width=128 should be strictly better than generic for running on SKX. GCC could do worse if it bloats the code-size with AVX512 EVEX-encoded instructions unnecessarily (e.g. vmovdqu64 xmm instead of vmovdqu xmm, when not using xmm16..31), and generally compare-into-mask should be good vs. the SSE/AVX way of compare-into-vector and blend. But you should definitely test with the default width=256, too, in case the turbo penalty is worth it for your code. Doing twice as much work per uop is very good, and the big penalties only kick in with 512-bit vectors.Kortneykoruna
I actually see what you just mentioned -- vmovdqu64 (%rdx),%xmm0, vmovdqu64 0x10(%rsi),%xmm6, etc when I compiled with -march=skylake-avx512 -mprefer-vector-width=128. It seems like GCC 8.2 isn't doing it right (or not what you expected)?Anoxia
@HCSF: Yes, that's a missed optimization in GCC that hurts code size, but otherwise isn't a problem. If GCC isn't getting any benefit from AVX512 features like more registers or masking, or new instructions like vpternlogd xmm, then try -mno-avx512f as well to see if the code-size effect makes a difference. But most instructions have a SIMD element size, so there's no separate mnemonic for the EVEX version that allows per-element masking. Thus the assembler can assemble vpaddd %xmm to the VEX version, and GCC can't shoot itself in the foot. (except by using xmm16..31)Kortneykoruna
Tried -march=skylake-avx512 -mprefer-vector-width=128 -mno-avx512f doesn't even change the size of my binary by 1 byte (I used strip command to remove text stuffs first)Anoxia
@HCSF: It might slightly change code layout inside some functions; function entry points are still padded to 16 bytes. But yeah if you don't have a lot of vector mov instructions or other cases for this missed optimization, GCC's .p2align directives are going to pad that space back out unless you happen to shrink across an alignment boundary. So no large-scale L1i cache pressure effect, and probably no uop-cache or other front-end effect either. Actually it might be just vmovdq[au]64 where this happens: gcc still uses vpand not vpandq gcc.godbolt.org/z/j_qysCKortneykoruna
I actually diff between the assembly code of the two binaries. And you are right that with -mno-avx512f, vmovdqu64 isn't used anymore. I guess it is better to set -mno-avx512f -mno-avx512pf -mno-avx512er -mno-avx512cd -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512ifma -mno-avx512vbmi -mno-avx512vbmi2 -mno-avx512bf16 -mno-avx512bitalg -mno-avx512vpopcntdq -mno-avx512vp2intersect -mno-avx5124fmaps -mno-avx512vnni -mno-avx5124vnniw to avoid all unnecessary avx512 related instructions? Thought?Anoxia
@HCSF: lol. All of those other extensions depend on AVX512F, so the way GCC works is that -mno-avx512f will disable them all. Just like -mno-avx disables AVX2 and FMA instructions. But anyway, depending on your code having AVX512VL available can help, e.g. for vpternlogd or for masked instructions. It would be a mistake to always use -mno-avx512f along with -mprefer-vector-width=128, without letting the compiler take a stab at using AVX512VL. (AVX512 Vector Length is the extension that provides 128 and 256-bit versions of instructions. It's separate because Xeon Phi lacks it)Kortneykoruna
It seems that if I allow AVX512VL, it might use some 256-bit registers? Tho, it sounds like even if I use -mprefer-vector-width=128, gcc might still use 256-bit registers? I just checked that I don't see any ymm register in my decompiled code.Anoxia
@HCSF: huh? AVX2 allows gcc to use 256-bit registers if it wants to. But -mprefer-vector-width=128 makes it not want to. You almost always want AVX + AVX2 enabled even if only using 128-bit vectors, for unaligned memory operands and for 3-operand non-destructive stuff. Disabling AVX512F is just to stop gcc from using longer EVEX instructions sometimes, not to stop it from using 256 or 512-bit instructions.Kortneykoruna
oh, you mean I shouldn't always use -mno-avx512f along with -mprefer-vector-width=128 because that would disable AVX512VL completely. However, with -mprefer-vector-width=128 alone, AVX512VL is still possible but that it won't use 256 bits?Anoxia
@HCSF: Right. You usually want to give gcc the option of using any 128-bit SIMD instructions your CPU supports, even if they're only available with AVX512 encodings. Again vpternlogd is really really good if you ever have boolean functions, and unsigned or 64-bit <-> float or double are much more efficient with AVX512. And vpermt2d is a really powerful 2-input shuffle, better than shufps. The only reason for disabling AVX512 is in case gcc shoots itself in the foot with mask regs or code-size, not because of 256-bit vectors.Kortneykoruna
Does your answer imply that running AVX2 instructions with xmm only has no penalty that could be with ymm-operand instructions?Saint
If so, would compiling manually vectorized code optimized for using xmm registers with the -mavx2 flag have no penalty compared to being compiled with -msse4.2?Saint
@xiver77: Yes, same for AVX-512VL instructions like vpternlogd xmm. AFAIK, it's only ever the width (and light vs. heavy) that matters, not what ISA extension introduced it. If you're worried about turbo penalties even from YMM, use -march=native -mprefer-vector-width=128 so the compiler won't use YMM for copying or initializing structs, either.Kortneykoruna
@PeterCordes Given Intel have recently removed AVX512 from the consumer lineup, do you think they will remove these power/downclock constraints? Or they can't for technical reasons?Tumbledown
@user997112: There will probably still be cases where the L0 license is higher than the L1 license (see BeeOnRope's answer to this question), so there will be CPUs that need to downclock slightly for "heavy" 256-bit instructions, even before reaching thermal limits. AVX downclocking has been a thing since Haswell, IIRC, although the effect was much stronger with 512-bit instructions in some CPUs.Kortneykoruna
@PeterCordes (or Travis) got a related question I can't post on SO. Do you know of any Sapphire Rapids architecture documentation? The usual Intel documents and Agner Fog both seem to focus on Alderlake. Googling is only returning the same sales powerpoint slides. I'm interested how the cores communicate and the inter-processor communication. I'd also like to go through the specifics- L1/2/3/local/remote/TLB latencies, execution ports etc.Tumbledown
@user997112: I haven't looked into any changes in the interconnect since Ice Lake, although chipsandcheese.com/2023/03/12/a-peek-at-sapphire-rapids did some cache and memory latency testing back in March. The cores and L1 cache should be identical to Alder Lake P-cores (except with a second 512-bit FMA unit in some models), so uops.info and Agner Fog's stuff should be applicable for execution ports. Intel's optimization guide might have some stuff.Kortneykoruna
@PeterCordes That looks interesting. According to this jprahman.substack.com/p/sapphire-rapids-core-to-core-latency the inter-processor latency is 400-700ns.Tumbledown
@user997112: I'm skeptical of that microbenchmark result. chipsandcheese.com/2023/11/07/… reports worst-case core-to-core latency of 81ns within one socket, and it looks like 155 ns worst-cse across sockets. (Average 59 and 138 ns). Maybe they're testing differently, like with only those 2 cores active at once, vs. with all cores actively spamming inter-core traffic creating more contention in the mesh interconnect and the links between sockets? (Real workloads don't spend all their time on inter-core transfers.)Kortneykoruna

© 2022 - 2024 — McMap. All rights reserved.