Does a branch misprediction flush the entire pipeline, even for very short if-statement body?
Asked Answered
B

4

14

Everything I've read seems to indicate that a branch misprediction always results in the entire pipeline being flushed, which means a lot of wasted cycles. I never hear anyone mention any exceptions for short if-conditions.

This seems like it would be really wasteful in some cases. For example, suppose you have a lone if-statement with a very simple body that is compiled down to 1 CPU instruction. The if-clause would be compiled into a conditional jump forward by one instruction. If the CPU predicts the branch to not be taken, then it will begin executing the if-body instruction, and can immediately begin executing the following instructions. Now, once evaluation of the if-condition has reached the end of the pipeline, which could be, say, 12 cycles later, the CPU now knows whether it's prediction was right or wrong. If it mispredicted, and the branch was actually taken, then the CPU really only has to discard 1 instruction from the pipeline (the one in the if-body). However, if it flushes the entire pipeline, then all the work that was done on the following instructions was wasted as well, and will have to be repeated for no reason. That's a lot of wasted cycles on a deeply pipelined architecture.

So do modern CPUs have any mechanism to discard only the few instructions that are inside of a short if-body? Or does it really flush the entire pipeline? If it's the latter, then I suppose using a conditional move instruction would get better performance. As an aside, does anyone know if modern compilers are good at converting short if-statements into cmov instructions?

Bierce answered 8/4, 2015 at 18:22 Comment(1)
One technique for accomplishing this is called dynamic predication (usually only for hammock branches). For a one-instruction forward branch, this is actually implemented in POWER7. ("Wish branches" were proposed to provide a hint to hardware for branches that might use dynamic predication). The tradeoffs are complex (especially for out-of-order processors). The special handling is not free, so if the branch prediction accuracy is high using prediction rather than predication makes sense. (Might write up an answer later.)Sigmoid
S
13

Most general purpose processors do flush the pipeline on a branch misprediction. The negative performance impact of conditional branches has motivated proposals for eager execution (where both paths are executed and the correct path selected later) and dynamic predication (where instructions in the branch shadow are predicated) in addition to extensive research on branch prediction (as well as other techniques). (Mark Smotherman's page on eager execution provides some details and references. I would add Hyesoon Kim et al.'s "Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution", 2005, as a significant paper.)

IBM's POWER7 seems to be the first mainstream processor to implement anything more sophisticated than prefetching an alternate path (i.e., eager fetch), and it only handles the single instruction case. (POWER7 uses a branch prediction confidence estimate to choose whether to predicate or use prediction.)

Eager execution has the obvious problem of exploding resource use. Even with selective eagerness based on branch prediction confidence, speculation depth, and resource availability (information available to the front-end), it can easily be more effective to speculate deeper down a single path. Discovering the joining points of multiple paths and avoiding excessive redundant computation can also add complexity. (Ideally, control independent operations would only be executed once and joining and data flow would be optimized, but such optimization adds complexity.)

For a deeply pipelined in-order processor, it may seem attractive to predict short forward branches as not taken and only flush backward in the pipeline to the instruction targeted by the taken branch when the branch is actually taken. If only one such branch is allowed in the pipeline at a time (other branches uses prediction), adding a single bit to each instruction could control whether it is converted to a nop or executed. (If only the case of a single instruction being branched over is handled, allowing multiple branches in the pipeline might not be especially complex.)

This would be similar to annul-if-taken branch delay slots. MIPS has "Branch Likely" instructions that annulled if not taken, and these are marked as obsolete in Revision 2.62. While some of the justification for such is presumably to separate implementation from interface and the desire to recover instruction encoding space, this decision also hints that the concept has some issues.

If this was done for all short forward branches, it would throw away instructions when the branch was correctly predicted as taken. (Note that this penalty could be less if taken branches always experience a delay in fetch redirection, which would be more likely with a multi-cycle instruction cache access in a deeply pipelined processor. In that case, fetching as if there was no branch could have the same performance as a correctly predicted taken branch. However, one could argue that the processor special case such short taken branches to minimize such fetch bubbles.)

As an example consider a scalar pipeline (non-branch instructions per cycle equal to 1.0) with branch resolution at the end of the eighth stage and no fetch redirection penalty on correctly predicted taken branches, handling single-instruction branch-overs. Assume 75% branch predictor accuracy (unbiased by direction) for such short forward branches (2% of instructions, taken 30% of the time) and 93% accuracy for other branches (18% of instructions). Eight cycles would be saved for short branches that would be mispredicted as taken (17.5% of such branches; 0.35% of instructions), seven cycles when mispredicted as not taken (7.2%; 0.144%), and one cycle would be lost when correctly predicted as taken (22.5%; 0.45%). In total 0.03358 cycles per instruction would be saved. Without this optimization the cycles per instruction would be 1.2758.

(While the above numbers are just for example, they are probably not far from reality except for the 1.0 IPC for non-branch instructions. Providing a small loop cache would reduce the misprediction penalty (and save power in short loops) because instruction cache access would probably be three of the eight cycles. Adding the effect of cache misses would further reduce the percentage improvement from this branch optimization. Avoiding the overhead for predicted "strongly taken" short branches might be worthwhile.)

In order designs tend to use narrow and shallower pipelines and prefer simplicity (for lower design, power, and area costs). Since the instruction set is likely to support branchless code for many short-branch cases, the incentive to optimize this aspect is further decreased.

For out-of-order implementations, the potentially branched over instructions would have to be predicated since the processor would want to be able to execute later non-dependent instructions. Predication introduces an additional data dependency which must be checked for scheduling. It is common for instruction schedulers to provide only two comparators per instruction and to split a conditional move (a simple instruction with only three data flow operands: the old value, the alternative value, and the condition; a predicated register-register add would have four operands. (There are alternative ways of addressing this issue, but this answer is already long.)

An out-of-order implementation would also not stall when a branch condition is not available. This is a tradeoff between a control dependency and a data dependency. With accurate branch prediction a control dependency is extremely inexpensive, but a data dependency can hold up forward progress waiting on data operands. (Of course, with a boolean data dependency, value prediction becomes somewhat more attractive. Using predicate prediction might be desirable in some cases and would have the advantage over simple predication of using dynamic cost and confidence estimates.)

(It is perhaps telling that ARM chose to drop extensive predication in 64-bit AArch64. While a large part of this is for instruction encoding, the benefit of predication for high-performance implementations is presumably relatively low.)

Compiler issues

The performance of branchless versus branching code depends on the predictability of the branch and other factors (including, if taken, any penalty for redirecting fetch), but it is difficult for the compiler to determine the predictability of a branch. Even profile data typically only provides branch frequencies which can give a pessimistic view of predictability since such does not account for the branch predictor using local or global history. A compiler is also not perfectly aware of timing of data availability and other dynamic aspects. If the condition is available later than the operands used for computation, then replacing a control dependence (branch prediction) with a data dependence (predication) could degrade performance. Branchless code may also introduce more live values, potentially adding register spill and fill overhead.

Complicating this further, most instruction sets that only provide conditional move or select instructions do not provide a conditional store. While this can be worked around by using conditional move to select a safe, ignored storage location, such seems an unattractive complication. In addition, conditional move instructions are often more expensive than simple arithmetic instructions; an addition and conditional move might take three cycles where a correctly predicted branch and addition would take zero (if addition is branched over) or one cycle.

A further complication is that predicated operations are generally ignored by the branch predictor. If a later retained branch correlates with the condition of the removed branch, the branch misprediction rate may increase for that later branch. (Predicate prediction could be used to retain the predictor effects of such removed branches.)

With the increased emphasis on vectorization, the use of branchless code becomes even more significant since branch-based code constrains the ability to use operations on an entire vector.

Sigmoid answered 11/4, 2015 at 2:20 Comment(1)
Sorry for the length. I did not cover some of things that might be interesting and did not provide a thorough explanation of the tradeoffs (especially for out-of-order implementations), but it seemed getting a not too untimely answer was better than a more complete and better organized answer possibly sometime within the next few years.Sigmoid
C
7

Modern high-performance out-of-order CPUs usually do not flush the entire pipeline0 on a misprediction, but it doesn't really depend on the distance of the branch or work as you suggest.

They generally use something similar to the strategy of flushing the branch instruction and all younger instructions. The front-end is flushed, this this will be full of instructions on the mispredicted path, but beyond the front-end modern cores may have more than 100 instructions in-flight at once, only some of which may be younger than the branch.

This means that the cost of the branch is at least partly related to the surrounding instructions: if the branch condition can be checked early the impact of a mis-prediction can be limited or even zero1. On the other hand, if the branch condition is handled late, after considerable resources have been spent on the wrong path, the cost can be large (e.g., larger than the 12-20 cycle "published" branch misprediction penalty you'll often see).


0 The exact terminology is up for debate here: the meaning of flushing the pipeline isn't entirely clear for out-of-order processors. Here I mean that the CPU does not flush all in-flight-but-possibly-not-executed instructions.

1 In particular, the limiting factor for some sequence of instructions could be a dependency chain whose current execution is far enough behind the leading edge of the instruction window that the misprediction doesn't flush any of those instructions and doesn't slow down the code at all.

Criseldacrisey answered 4/3, 2018 at 2:50 Comment(13)
Yup, mispredicted branches have special handling, unlike other exceptions which do flush the pipeline, because branch misses are common. CPUs have a rollback buffer that snapshots the register-renaming / other architectural state at every conditional / indirect branch. (Using it for every instruction that potentially could trap, like loads/stores, would fill it up too quickly.) IDK if having this buffer fill up ever limits correctly-predicted branch throughput, if the predictions can't be checked quickly. It seems to be rarely mentioned in discussions of microarchitecture.Mcgruder
I'm not sure if all CPUs use a checkpoint architecture. Are you sure these mechanisms aren't also used for other types of speculation failure such as memory ordering speculation failure?Criseldacrisey
I'm pretty sure that's why memory-ordering mis-speculation is a machine-nuke but a branch miss isn't. I'm not sure exactly what the internal mechanism is, but I assume it has the same effect as a checkpoint of RAT state. According to ieice.org/proceedings/ITC-CSCC2008/pdf/p233_D3-4.pdf, the current methods are checkpointing or waiting for the mis-predicted branch to reach the head of the ROB (to get the in-order state at that point), but the method without checkpoints can be much slower. (The paper goes on to propose a new idea, but I haven't read it yet.)Mcgruder
patents.google.com/patent/US6799268 Intel's patent for a Branch Order Buffer from 2000/06/30Mcgruder
I think this patent was for P4 (using a PRF instead of a separate retirement register file). They mention a prior patent for a CPU with a separate retirement register file, and how that might need copying while rolling back. Anyway, instead of an actual copy of the RAT, I think it's saving pointers so it can replay from the ROB and re-create the right RAT state, or something like that. So it still takes time. They don't mention doing it for memory-order mis-speculation. They talk about detecting / marking when the instruction is a branch instruction specifically.Mcgruder
If the rollback is not performed until the commit of the branch (even if fetch is stopped after the mispredicted branch is executed), then the pipeline is fully flushed. (Such delayed rollback can simplify the design both in removing instructions from the scheduler (a simple full flush of all ops owned by a thread) and potentially in establishing a correct RAT (a commit RAT can simply be copied to the rename RAT).)Sigmoid
Even with ROB-based renaming (in which the committed values are copied to an architectural register file so the RAT can be mapped to the arch. registers), the schedulers will have dead instructions. These can be "harmlessly" executed simply by delaying the freeing of their destinations and letting them be scheduled as usual. Alternatively, fast execution could be implemented for misprediction recovery with every op producing a "result" signal immediately (1 cycle execution delay), potentially even avoiding some structural hazards. This seems related to replay storms.Sigmoid
Other mechanisms could be conceived to re-establish the RAT and remove dead ops from the scheduler.Sigmoid
@PaulA.Clayton: We know that current x86 CPUs definitely don't just wait until the mispredicted branch is ready to retire. I think they do discard the stale uops from the schedulers; maybe w/ that fast-execution mechanism. (Related: SnB can discard one of the flag-merging uops from a variable-count shl eax, cl if the flag-result is overwritten without being read, without ever using an execution unit on it. I quoted Intel's optimization manual 3.5.1.6 in this answer. The front-end bandwidth to issue/rename it can't be recovered, of course.)Mcgruder
@PaulA.Clayton - I guess a lot of it comes down to how you define "flush the pipeline" for OoO architectures. I think everyone pretty much agrees on what it means for a classic 5-state in-order pipeline, or almost any in-order pipeline, but once you have something like a ROB with 100s of instructions in-flight, do those count as "in the pipeline"? My answer here uses the definition that "flushing the pipeline" means flushing all in-flight/not-committed state, including ROB entries (exceptions or interrupts usually do this, AFAIK).Criseldacrisey
In that scenario, I think it's fair to say "flushing the pipeline" doesn't occur on at least modern x86 designs, since older in-progress instructions are often not flushed (regardless of how the recovery occurs). Other definitions are possible! It is clear that almost all architectures flush the newer instructions, after all they are "all wrong" (POWER7 dynamic predication being an interesting counter-example that you mention). I added this answer mostly to answer just the "pipeline flush?" part of the question title, since your answer covered more optimizations for the short branch case.Criseldacrisey
@Criseldacrisey Well, the classic 5-stage pipeline does not have branch mispredictions (assuming a branch delay slot)☺. Even an in-order scalar processor would not kill executing older (high latency) instructions, but one would say that the pipeline "behind" the branch was flushed. A skewed (in-order) pipeline that executed branches late might have to flush wrong path instructions that were already completed. (In-order designs can use a future file or the forwarding network to support deferred in-order commitment of results with out-of-order completion.)Sigmoid
@PaulA.Clayton - well it depends on the implementation (as you allude to with branch-delay slot caveat)! That's my point: any in-order pipeline has an obvious location that you can point to and say "everything older than this is flushed" and sometimes a few things newer too as an implementation convenience. The details aren't really super important for in-order because the differences mean a couple of extra cycles/instructions in almost any case. For OoO, however, it all of a sudden becomes very important (and definitions matter) because we could be talking about 100s of instructions.Criseldacrisey
W
3

"If it mispredicted, and the branch was actually taken, then the CPU really only has to discard 1 instruction from the pipeline (the one in the if-body)."

That's not as easy as you make it sound. Instructions modify various different states in the architecture on which other instructions rely on (register results, condition flags, memory, etc). By the time you realize you've mis-predicted, you could potentially have tons of instructions in the pipeline that have started execution based on state changed by that instructions and all subsequent instructions in the pipeline... Not to mention instructions that can raise faults/exceptions.

A simple example:

b = 0
f (a == 0) {
    b = 1;
}
c = b * 10;
if (b == 0)
    printf("\nc = %d.",c);
foo(b);
etc..

To undo that "one simple instruction" would take a lot of work.

For simple branches with poor predictability, predication/cmovs/etc are preferred.

Wicket answered 8/4, 2015 at 19:22 Comment(0)
O
1

At least with most processors a mispredicted branch does flush the entire pipeline.

This is a large part of why many (most?) current processors also provide predicated instructions.

On the ARM, most instructions are predicated, meaning the instruction itself can include a condition to say, in essence, "do X, but only if the following condition is true."

Likewise, recent iterations of x86/x64 include some predicated instructions, such as "CMOV" (conditional move) that works the same way--only carry out the instruction if a condition is met.

These do not flush the pipeline--the instruction itself always just flows through the pipeline. If the condition isn't met, the instruction basically just doesn't have any effect. The down-side is that the instructions take execution time, even when they have no effect.

So, in a case like you're talking about (an if statement with a tiny body) that can be implemented in only a few instructions, you can implement those as predicated instructions.

If the body takes enough instructions (roughly the size of the instruction pipeline, multiplied by some constant factor) it starts to make more sense to use a conditional jump instead.

Osorio answered 8/4, 2015 at 18:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.