Why did Intel change the static branch prediction mechanism over these years?
Asked Answered
I

3

27

From here I know Intel implemented several static branch prediction mechanisms these years:

  • 80486 age: Always-not-taken

  • Pentium4 age: Backwards Taken/Forwards Not-Taken

  • Newer CPUs like Ivy Bridge, Haswell have become increasingly intangible, see Matt G's experiment here.

And Intel seems don't want to talk about it any more, because the latest material I found within Intel Document was written about ten years ago.

I know static branch prediction is (far?) less important than dynamic, but in quite a few situations, CPU will be completely lost and programmers(with compiler) are usually the best guide. Of course these situations are usually not performance bottleneck, because once a branch is frequently executed, the dynamic predictor will capture it.

Since Intel no longer clearly statements the dynamic prediction mechanism in its document, the builtin_expect() of GCC can do nothing more than removing the unlikely branch from hot path.

I am not familiar with CPU design and I don't know what exactly mechanism Intel use nowadays for its static predictor, but I still feel the best mechanism for Intel should be to clearly document his CPU 'where I plan to go when dynamic predictor failed, forward or backward', because usually the programmer is the best guide at that time.

Update:
I found the topics you mentioned gradually go beyond my knowledge. Some dynamic prediction mechanism and CPU internal details are involved here which I can't learn within two or three days. So allow me quit your discussion temporarily and recharge.
Any answer is still welcome here, maybe will help more people

Intergrade answered 13/8, 2018 at 12:46 Comment(0)
E
30

The primary reason why static prediction is not favored in modern designs, to the point of perhaps not even being present, is that static predictions occur too late in the pipeline compared to dynamic predictions. The basic issue is that branch directions and target locations must be known before fetching them, but static predictions can only be made after decode (which comes after fetch).

In more detail...

CPU Pipelining

Briefly, during execution needs to fetch instructions from memory, decode those instructions and then execute them1. On a high-performance CPU, these stages will be pipelined, meaning that they will all generally be happening in parallel - but for different instructions at any given moment. You could read a bit about this on Wikipedia, but keep in mind that modern CPUs are more complex, generally with many more stages.

On a modern x86, with a complex-to-decode variable-length instruction set, there may be many pipeline "stages" involved simply in fetching and decoding instructions, perhaps a half-dozen or more. Such instructions are also superscalar, capable of executing several instructions at once. This implies that when executing at peak efficiency, there will be many instructions in flight, in various stages of being fetched, decoded, executed and so on.

Redirecting Fetch

The effect of a taken branch is felt on the entire initial portion (usually called the front-end) of the pipeline: when you jump to a new address, you need to fetch from that new address, decode from that new address, etc. We say that a taken branch needs to redirect fetch. This puts certain restrictions on the information that branch prediction can use in order to perform efficiently.

Consider how static prediction works: it looks at the instruction and if it is a branch, compares its target to see if it is "forwards" or "backwards". All this must happen largely after decoding has occurred, since that's when the actual instruction is known. However, if a branch is detected and is predicted taken (e.g., a backwards jump), the predictor needs to redirect fetch, which is many pipeline stages earlier. By the time fetch gets redirected after decoding instruction N there are already many subsequent instructions that were fetched and decoded on the wrong (not taken) path. Those have to be thrown away. We say that a bubble is introduced in the front-end.

The upshot of all of this is that even if static prediction is 100% correct, it is very inefficient in the taken branch case since the front-end pipelining is defeated. If there are 6 pipeline stages between fetch and the end of decode, every taken branch causes a 6-cycle bubble in the pipeline, with the generous assumption that the prediction itself and flushing bad-path instructions take "zero cycles".

Dynamic Prediction to the Rescue

Modern x86 CPUs, however, are able to execute taken branches at up to 1 every cycle, much better than the limit even for perfectly predicted static execution. To achieve this, the predictor usually cannot use information available after decoding. It must be able to redirect fetch every cycle and use only inputs available with a latency of one cycle after the last prediction. Essentially, this means predictor is basically a self-contained process that uses only its own output as input for the next cycle's prediction.

This is the dynamic predictor on most CPUs. It predicts where to fetch from next cycle, and then based on that prediction it predicts where to fetch from the cycle after that, and so on. It doesn't use any information about the decoded instructions, but only past behavior of the branches. It does eventually get feedback from the execution units about the actual direction of the branch, and updates its predictions based on that, but this all happens essentially asynchronously, many cycles after the relevant instruction has passed through the predictor.

Adding It Up

All of this serves to chip away at the usefulness of static prediction.

First, the prediction comes too late, so even when working perfectly it implies a bubble of 6-8 cycles on modern Intel for taken branches (indeed, these are observed figures from so-called "front-end resteers" on Intel). This dramatically changes the cost/benefit equation for making a prediction at all. When you have a dynamic predictor before fetch making a prediction, you more-or-less want to make some prediction and if it has even 51% accuracy it will probably pay off.

For static predictions, however, you need to have high accuracy if you ever want to make a "taken" prediction. Consider, for example, an 8-cycle front-end resteer cost, versus a 16 cycle "full mispredict" cost. Let's say in some program that cold backwards branches are taken twice as often as not taken. This should be a win for static branch prediction that predicts backwards-taken, right (compared to a default strategy of always "predicting"2 not-taken)?

Not so fast! If you assume an 8-cycle re-steer cost and a 16-cycle full mispredict cost, they end up having the same blended cost of 10.67 cycles - because even in the correctly predicted taken case where is an 8 cycle bubble, but in the fall-through case there is no corresponding cost for the no-static-prediction case.

Add to that that the no-static-prediction case already gets the other half of static prediction correct (the forward-branches not-taken case), the utility of static prediction is not as large as one would imagine.

Why the change now? Perhaps because the front-end part of the pipeline has lengthened compared to the other parts, or because the increasing performance and memory of the dynamic predictors means that fewer cold branches are eligible for static prediction at all. Improving performance of static predictors also means that the backwards-taken prediction becomes less strong for cold branches, because loops (which are the reason for the backwards-taken rule) are more frequently remembered by the dynamic predictor.

Saving Dynamic Prediction Resources

The change could also be because of an interaction with dynamic prediction: one design for a dynamic predictor is not to use any branch prediction resources at all for a branch that is never observed to be taken. Since such branches are common, this can save a lot of history table and BTB space. However, such a scheme is inconsistent with a static predictor that predicts backwards branches as taken: if a backwards branch is never taken, you don't want the static predictor to pick up this branch, and predict it as taken and so messing up your strategy of saving resources for not-taken branches.


1 ... and also then do more stuff like retire, them - but what happens after execute mostly isn't important for our purposes here.

2 I put "predicting" in scare-quotes here because in a way it's not even predicting: not-taken is the default behavior of fetch and decode in the absence of any prediction to the contrary, so it's what you get if you don't put in any static prediction at all, and your dynamic predictor doesn't tell you otherwise.

Elfriedeelfstan answered 14/8, 2018 at 19:19 Comment(8)
Slow jmp-instruction has an interesting example of a small or large block of jmp +0 instructions which run much slower once there are too many. Presumably because the BTB runs out of space and can no longer predict them correctly before they decode. (And it shows jmp +0 isn't special-cased to be treated as not-taken or a nop.)Rhyton
I always assumed the fetch stage had a much simpler decoder which could only compute the instruction length and detect branch instructions. So that is not true?Mcghee
@Mcghee - I don't think there's an absolute answer because different chips might work in different way, but yeah I don't think fetch on modern x86 has a decoder. The entire decode pipeline is something like 4 stages, so probably even a shorter version is too long for a fetch engine that needs to fetch a line every cycle. More importantly something like the L1I cache would have a latency of several cycles, so if you have fetch-decode-check-for-branches in the fetch loop you'll only be able to do one fetch every several cycles!Elfriedeelfstan
This leads to the conclusion that the fetch loop probably just uses only branch prediction hardware, at some of which has a 1-cycle iteration time.Elfriedeelfstan
@PeterCordes Unlike call +0 funnily enough.Ppi
@fuz: I know call +0 is special-cased for the return-address predictor. Is it also special-cased to not even be a jump for the front-end either, just a push rip? That's testable: alternating jmp +0 / call +0 could run at 2 IPC if that was correct. (With an occasional reset of RSP to get L1d hits to avoid a store bandwidth bottleneck.)Rhyton
@fuz: I tried it on my i7-6700k Skylake: A NASM experiment (godbolt.org/z/qcf66K548 with perf stat output) found that jmp rel32=+0 / cs call +0 runs twice as fast as jmp +0/call +1/padding. So yeah, seems you're right. Neat! (Also less than 1 IPC using eb 00 jmp rel8=0 in an earlier mix; godbolt.org/z/KK5K4YMe8. But pure jmps without any calls run even slower than that.)Rhyton
It mostly runs from legacy decode (due to the JCC erratum, which actually isn't limited to conditional branches, and because an unconditional branch ends a uop cache line). And it bottlenecks on the front-end not back-end (low counts for resource_stalls.any), even though legacy decode should be able to keep up. Removing the calls, keeping just 2-byte jumps, drops performance to 0.19 IPC, so probably branch prediction has a hard time when branches are packed so tight. Anyway, 1.0 IPC vs. 0.46 IPC just from changing call +0 to call +1 (over something) is definitive I think.Rhyton
T
10

Static branch prediction as discussed in Section 3.4.1.3 of the Intel Optimization Manual is as follows:

  • Predict unconditional branches to be taken.
  • Predict conditional forward branches to be not taken.
  • Predict conditional backward branches to be taken.
  • Predict indirect branches to be not taken.

Compilers can organize the code accordingly. The same section says the following:

The Intel Core microarchitecture does not use the static prediction heuristic. However, to maintain consistency across Intel 64 and IA-32 processors, software should maintain the static prediction heuristic as the default.

This statement indicates that Section 3.4.1.3 has not been updated for many years.

If the dynamic predictor failed to predict that there is a branch instruction among the bytes fetched or if it suffers a miss in its buffers, then the fetch unit will just continue fetching sequentially because there is no other meaningful choice, effectively making a static prediction of Not Taken.

However, if it turns out, in the Instruction Queue Unit, that there is a conditional or indirect branch instruction in the fetched byte stream, then it would make sense at this point to make a static prediction that is potentially better than Not Taken. In particular, predicting conditional direct backward branches Taken. This can reduce the penalty of the failure of the dynamic predictor and the Not-Taken fetch unit, especially that the performance of the frontend is so critical. To my knowledge, there is no clear statement in the optimization manual that states that there is such static predictor at the IQU and that applies to modern processors. However, as I discuss in my other answer, the desciption of some performance counters seem to imply that there can be such static predictors at the IQU.

Overall, I think this is an implementation detail that Intel no longer documents.

Compiler-assisted dynamic branch prediction techniques do exist and can be very useful as you suggested, but they are not used in current Intel processors.

Tabbatha answered 13/8, 2018 at 22:10 Comment(12)
Hi, this is the Intel document I found, and I didn't see the prediction behavior you listed in section 4.1.3.3, can you give me a link? As agner's blog, section 3.5 described, Intel didn't use static prediction in PM and Core2. And Matt G's experiment also indicates that newer Intel CPUs had no BT/FNT static prediction.Intergrade
@Intergrade It's 3.4.1.3, not 4.1.3.3.Tabbatha
@Intergrade Matt's article does not say or imply that newer processors don't support static prediction, it only says that it's difficult to see the effect of static prediction on newer processors.Tabbatha
Are you sure this section of the manual applies to SnB-family? Some parts of the optimization manual are written as if they apply to everything, but were actually written back in the P4 era and never updated when it stopped being generally true. Those "coding rule" entries like 3.4.1.3 are often stale. As long as they don't actually hurt modern CPUs, Intel often doesn't bother to update them. (e.g. add is still always recommended over inc, but the real situation is more complicated. INC instruction vs ADD 1: Does it matter?.)Rhyton
@PeterCordes I'll try to confirm this from another source. But I expect all modern high-perf processors to support static branch prediction.Tabbatha
@HadiBrais yes It's section 3.4.1.3, I am sorry. According to Matt G's experiment, yes I can't tell the detail of Intel's implementation of static prediction, but the minimal conclusion is, BT/FNT is no longer applied. The newer Intel CPUs no longer have a certain and clear behavior on static prediction, and this is what I care about because such design leaves no chance for programmer(compiler) to write static-prediction-friendly code. I am curious to know had Intel implemented a more efficient and mysterious static algorithm, or, just discard the static predictor thinking not worth it ?Intergrade
@Intergrade Static prediction, by definition, is always straightforward. The challenge that Matt faced was that to make the dynamic predictor to fail to trigger the static predictor. According to the article, the dynamic predictor used to be simple enough in older CPUs, so that was easy to do. But now Matt shows that this has become much more difficult. This makes sense because newer microarchitectures expand the history (buffers) of the predictor and improve the algorithm, which reduces the impact of the static predictor.Tabbatha
@HadiBrais: To support static prediction, a branch predictor needs an "invalid" state, or a tag-check to detect aliasing. Without either of those, you just index and find an entry, and use it, even if the prediction data in it came from other branches (because you can't detect that). Especially before Spectre was discovered, simply letting branches alias each other instead of evicting an entry for another branch makes sense: the branch that runs more often will dominate and be predicted better. (And TAGE indexes based on recent history so it's even less simple: one branch has many entries.)Rhyton
@PeterCordes TAGE uses (partial) tags and BTBs are often (partially) tagged (to allow associativity). If there is a BTB miss, a prediction that a branch is taken may be suspect (a static prediction could be made at the same time the target address would be available). Incidentally, luke-warm branches may be frequent enough as a class and individually sufficiently statically biased to make static prediction useful. (SPEC CPU is notorious for small branch footprint; even gcc might not have as many active branches as some common code. Benchmarks guide products.)Intrados
@PaulA.Clayton: Oh right, I was only thinking of taken/not-taken prediction. But yes of course even direct branches still need target prediction before they're decoded. Static prediction would be much more useful than a stale target.Rhyton
@Intergrade Updated my answer after getting feedback.Tabbatha
@PeterCordes Updated my answer after getting feedback.Tabbatha
R
6

My understanding is that with current designs, modern TAGE branch direction predictors always index to an entry, using the taken/not-taken history of recent branches. (This potentially spreads the state for a single branch out over a lot of internal state, making it possible to predict very complex patterns like a 10 element BubbleSort.)

The CPU doesn't try to detect aliasing and just uses the prediction it finds to decide taken/not-taken for conditional branches. i.e. branch-direction prediction is always dynamic, never static.

But a target prediction is still needed before the branch is even decoded to keep the front-end from stalling. The Branch Target Buffer is normally tagged, because the target of some other branch that aliased is unlikely to be useful.

As @Paul A Clayton points out, a BTB miss could let the CPU decide to use static prediction instead of whatever it found in the dynamic taken / not-taken predictor. We might just be seeing that it's much harder to make the dynamic predictor miss often enough to measure static prediction.

(I might be distorting things. Modern TAGE predictors can predict complex patterns for indirect branches too, so I'm not sure if they even try to predict in terms of taken/not-taken or if the first step is always just to try to predict the next address, whether or not that's the next instruction. Indexed branch overhead on X86 64 bit mode.)


Not-taken branches are still slightly cheaper in the correctly-predicted case, because the front-end can more easily fetch earlier and later instructions in the same cycle from the uop cache. (The uop cache in Sandybridge-family is not a trace cache; a uop-cache line can only cache uops from a contiguous block of x86 machine code.) In high-throughput code, taken branches could be a minor front-end bottleneck. They also typically spread the code out over more L1i and uop-cache lines.


For indirect branches, the "default" branch-target address is still next-instruction, so it can be useful to put a ud2 or something after a jmp rax to prevent mis-speculation (especially into non-code), if you can't simply put one of the real branch targets as the next instruction. (Especially the most common one.)


Branch prediction is kind of the "secret sauce" that CPU vendors don't publish details about.

Intel actually publishes instruction throughput / latency / execution-port info themselves (through IACA and some documents), but it's fairly straightforward to test experimentally (like https://agner.org/optimize/ and http://instlatx64.atw.hu/ have done) so it's not like Intel could keep that secret even if they wanted to.

Branch-prediction success rate is easy to measure with perf counters, but knowing why one specific branch was mispredicted or not on one specific execution is very hard; even measuring is hard for a single execution of one branch, unless you instrument your code with rdtsc or rdpmc or something.

Rhyton answered 13/8, 2018 at 12:59 Comment(7)
Although I've said the same thing before, I don't think it's correct to just say that Intel's (probably TAGE-like) predictors just use whatever prediction the history hashes to without an aliasing check. After all the T in TAGE stands for "tagged" - some tag based on the current hash is used to select predictor table entries that with high probability map to the current history. That's the basis for how TAGE choose which history length to use the in the first place: the longest history that gets a tag match. It's possible that the zero-length predictor that is used if all longer ...Elfriedeelfstan
... histories are used doesn't do a tag check, however (which would kind of give the random(ish) behavior that would be suggested by "no aliasing check"). You mention that a static prediction could be used if the BTB lookup misses, but that's not really feasible since this is all happening before decode (on Intel, probably at least a half dozen pipeline stages before the end of decode). Later on after decode, it is possible that static prediction could kick in and redirect the front-end, but this is much less profitable (especially when you consider the chance for a wrong prediction).Elfriedeelfstan
@BeeOnRope: you're right, if the predictors can predict the presence of a branch before decode, they probably have something to say about the target. I knew while I was writing this that it felt too hand-wavy. And thanks for the extra details about TAGE. IDK enough details to fix this answer; feel free to edit it significantly or copy parts into your own answer if you have any good ideas.Rhyton
@PeterCordes This SO answer's description of the BPU_CLEARS.EARLY event makes it seem like assuming correctly predict / in same cache level that not taken only outperforms taken branches if they not in the "fast" BTB. This article has some data on speed of contiguous jumps on AMD, but there do seem to be two spikes, possibly one where the expensive early circuitry of BTB is used up and another when full BTB overflows.Mountbatten
@Mountbatten - I don't think that's true. Even the fast path is probably limited to 1 jump per cycle (possibly 2 on Ice Lake) but not taken branches have no particular throughput limitation in the FE at all since they are the default: act just like any other instruction. Of course, not taken branches still have throughput limitations at execute, e.g. due to two branch ports.Elfriedeelfstan
@Elfriedeelfstan thats fair. I was more trying to say something along the lines of: I think its possible for a taken branch to not eat ~2-3 cycle on a CPU_CLEARS.EARLY aka fetch the next 16 bytes of correctly the first time. Not that it doesn't have its own additional bottlenecks. So if the branches are sparse and predicted early its possible for taken to have no extra overhead compared to not taken. But that may not be true (even if predicted early) if the branche are at all dense.Mountbatten
@Noah, yeah definitely. They can be as fast as 1 taken branch per cycle, which is pretty fast. At this speed the FE may not be the bottleneck if there are at least a few instructions (on average) between each jump (although "may not be the bottleneck" is also true for slower taken branch throughputs too: you just need larger basic blocks). You definitely don't get a BPU_CLEARS early every time there is a taken branch.Elfriedeelfstan

© 2022 - 2024 — McMap. All rights reserved.