Branch prediction overhead of perfectly predicted branch

Asked 16/3, 2018 at 2:31 Answered 16/3, 2018 at 20:6

Solved c++performance x86 cpu-architecture branch-prediction

I read that a perfectly predicted branch has zero / almost-zero overhead. (For example in an answer on Effects of branch prediction on performance?)

I don't quite understand what people mean by this. At least the branch condition has to be evaluated, which can be a simple bool or a function call, that takes time.

Cookhouse answered 16/3, 2018 at 2:31 Comment(4)

The branch prediction lets the CPU start processing code on the branch target before it knows if the condition is true. If it has "guessed" correctly that will speed up the code considerably. – Martinmas 16/3, 2018 at 2:39

Also evaluation is 1) almost free and 2) can’t reasonably be considered overhead. #2 is my opinion, but I think they consider overhead CPU stalls that wouldn’t happen otherwise. – Tempura 16/3, 2018 at 2:41

@zzxyz: yup, even a perfectly-predicted taken branch makes it harder for the front-end to keep up (keep the back-end supplied with instructions at the maximum rate). Fetch/decode of instructions from contiguous memory is easier than jumping around. See also agner.org/optimize, and other performance links in stackoverflow.com/tags/x86/info – Compassionate 16/3, 2018 at 8:47

@PhilipRoman if you're editing any more mis-tagged [branch] questions, there is a [branch-prediction] tag which you should replace it with, instead of just removing. (If appropriate.) Although since your edits have to go through a review queue until you have more rep, I wouldn't suggest actively looking for more questions to edit yourself. – Compassionate 11/12, 2021 at 16:20

Summary

Evaluating a branch condition always takes some work, even if perfectly predicted, but because of the internal parallelism in modern CPUs extra work doesn't necessary add to the cost of a particular instruction sequence.

Details

I think part of the confusion lies in the mental performance model many people have for the execution of CPU instructions. Yes, every instruction requires some work, so that should imply that every instruction has some cost, however small, when measured in execution time, right?

Well that would be true if the total cost of execution was simply additive in the work for each instruction - you just add together all the work and get the final cost. Because of the large about of parallelism in modern CPUs it doesn't work like that.

Think of it like organizing a birthday party. You may have to buy flour which takes 10 minutes and then bake a cake which takes 60 minutes, and go pick up a special gift which is 30 minutes away. Those timings are all the "work" required for the activity. However, someone can go pick up the gift while the flour is being picked up and the cake is being being baked. You can't bake the cake without the flour, however. So you have two dependency chains: the 70 minute buy flour -> bake cake chain, and the 30 minute pickup gift chain. With unlimited parallelism, only the 70 minute cake related chain contributes to the time at which everything is ready. Picking up the gift 30 minutes of work but it ends up costing no time (not delaying completion of all the tasks), due to other work that takes longer (aka the critical path) and happens in parallel.

More extra tasks can be done in parallel until you run out of people to assign to them. (At that point, execution throughput limits start to increase latency, and this is called a resource conflict. If a resource conflict delays the critical path, rather than one of the shorter dependency chains. CPUs don't know which dependency chain is / will be the critical path, so their scheduling doesn't prioritize it the way smart humans would in this planning analogy.)

For a less abstract and more practical look at look at how this stuff applies directly to CPUs, see A Whirlwind Introduction to Dataflow Graphs.

Once we have this new mental model where the cost of an instruction sequence is often dominated by the some critical path though the sequence, we can start to see why well-predicted branches are often very low or zero cost:

Branch instructions have no ouput register and no memory output¹. This means they can't participate in typical dependency chains except as the final node - they always end a dependency chain. So branches don't participate in the formation of long dependency chains and thus are in some sense "out of line" and free to be calculated in parallel with other results.
The actual execution of branch instructions generally needs very little work: on modern x86 they can execute on two ports, with 1 cycle latency. Furthermore, branch instructions can be fused with a previous ALU operation, and the resulting operation still executes in 1 cycle - so in some sense the branch can sometimes be folded into a prior operation for no additional work at execution². This obvious helps the "near zero cost" argument, but also helps the "truly zero cost" argument, since needing fewer resources means that it is less likely to trigger a throughput bottleneck that would disturb a zero cost execution schedule.

Those factors combine to make most predicted branch instructions zero cost or nearly zero cost.

You don't have to take my word for it, let's look at a real example:

int mul1(int count, int x) {
    do {
        x *= 111;
    } while (--count);

    return x;
}

Given a count and a starting value x, it multiplies x by 111 count times and returns the result. The loop assembles to 3 instructions One for the multiply, one for the --count and a branch to check the count value:

.L2:
  imul eax, eax, 111
  sub edi, 1
  jne .L2

Now here's the same loop, but with an additional branch:

int mul2(int count, int x) {
    do {
        x *= 111;
        if (x == 0) {
            abort();
        }
    } while (--count);

    return x;
}

This assembles to 5 instructions. The extra two are for the test of x and the branch the test shows that x is zero:

.L7:
  imul eax, eax, 111
  test eax, eax
  je .L12  ; ends up calling abort
  sub edi, 1
  jne .L7

So what is the cost of adding 60% more instructions, including a branch? Zero, at least to 4 significant digits³:

Running benchmarks groups using timer libpfc

** Running benchmark group stackoverflow tests **
                     Benchmark    Cycles
                     No branch     3.000
             Added test-branch     3.000

The look takes 3 cycles per iteration, because it is limited by the dependency chain involving 3-cycle multiply. The additional instructions and branch didn't cost anything because they didn't add to this dependency chain and were able to execute "out of line", hiding behind the latency of the critical path.

¹ Conceptually, branch instructions write the "rip" register, but this isn't treated like the other registers at all: its progression is predicted ahead of time, so the dependency is broken by the predictor.

² Of course, there is still additional work to decode and fuse the instruction in the first place, but this is often not the bottleneck so may be "free" in cost terms, and things like uop caches means that it may not even be performed frequently. Also, on x86, while a fused branch instruction has the same latency as an ALU op, it is less flexible in terms of which ports it can execute on, so depending on port pressure it may be the case that a fused instruction has some cost compared to the bare ALU op.

³ In fact, if you go to "infinite" significant digits and look at raw cycle counts, you see that additional iterations of this loop cost exactly 3 cycles in both cases. The no-branch case does usually end up 1 cycle shorter overall (a difference that goes to 0 in a relative sense as the iterations increase), perhaps because the initial non-steady-state iteration takes an additional cycle, or the misprediction recovery takes an additional cycle on the final iteration.

Allin answered 16/3, 2018 at 20:6 Comment(0)

Branch prediction is predicting the OUTCOME of your condition at instruction level, which is the actual result in the C or C++ condition - if that is the result of a million character string comparison, it's probably not particularly useful, because that comparison is going to take a lot of time. If it's the end condition of a the for loop that iterates over the two strings with a million characters each, then it's highly useful, because it happens many times in that loop (assuming the strings are equal).

It is NOT free to make a string comparison of two long strings. It is free to guess correctly that the string comparison will continue (until we either find the end of the string or a difference, at which point the branch prediction goes wrong).

A "unpredictable" branch will lead to the processor not knowing where the code continues to. Modern CPUs have fairly long pipelines (15-30 steps), so if the pipeline isn't filled with the "right" code, the processor will have to wait for the "right" code to trickle through the pipeline.

So to answer the actual queston:

When the branch itself is well predicted, the processor has already got the RIGTH instructions in the pipeline, and there is no "pipeline bubble" to walk through the pipeline before we can execute the correct instructions for continuing the program. See below for an analogy. If the prediction is wrong, there will be something other than the right instructions in the pipeline, and the processor has to chew its way through those, throwing them away.

Think of it as a car factory, making cars of models A and B, in a production line that first mounts the body onto a chassis, paints it (magic paint, it dries almost instantly), then fits the engine and gearbox, and then puts wheels on it, fits the lights, and finally the glass is fitted, and it's a complete car. Each step takes 20 minutes to perform, and the conveyor belt for the cars will move forward to the next position every 20 minutes [for this exercise, we ignore the fact that the move itself takes time].

You are in charge of the production line, and you have a bunch of A cars on the production line. Suddenly, the big boss says, "we've just got an order for B cars, change to making model B immediately". So, you start feeding B car parts onto the production line. But it will still take quite some time before the next B car comes out at the other end.

The way branch prediction works is that it "guesses" whether the code will change direction or go to the next instruction. If it guesses right, it's like guessing when "big boss" comes down to tell you to change between A and B models of cars, so you can have the right car ready to pop off the production line when the boss wants it, rather than having to wait for the whole production line to

When it works, it's great, because the expected stuff is ready to be done. If you guess wrong, you've still got to wait for the rest of the production line to run through the current set, and stash those into the "we don't have a customer for these" corner (or in terms of CPU instructions "discard the instruction").

Most modern CPU's also allow for "speculative execution". That means that the processor will start executing instructions BEFORE the condition is actually determined. So the processor will switch from A cars to working on B cars before the boss says so. If at that point, the boss says "no, you should keep working on A cars", you have a bunch of cars to discard, that you already started work on. You don't necessarily have to build all the cars, but you have to move them through the production line, one step every 20 minutes.

Allottee answered 16/3, 2018 at 4:16 Comment(6)

This is a great analogy for branch prediction, but I don’t think it actually answers the question asked. – Gromyko 16/3, 2018 at 8:29

The "not particularly useful" bit throws me. Is there a situation where a million character comparison doesn't boil down to a tight comparison loop iterating a whole bunch...each comparison being a near instant register comparison? Simplifying it to one slow comparison doesn't seem particularly...accurate..when discussing performance. But maybe I miss some nuance. – Tempura 16/3, 2018 at 16:43

And I suppose it depends on how you think about it. I suppose in terms of application code itself, maybe "one slow comparison" is fair enough... – Tempura 16/3, 2018 at 16:48

@Tempura He means that when comparing those strings in a loop the branch prediction is meaningless since the comparison itself takes orders of magnitude more than the mispredict penalty. Unlike with the character-by-character comparison. This implicitly answers my question: the "zero overhead" means "zero overhead - apart from evaluating the branch condition". – Cookhouse 16/3, 2018 at 18:20

@Cookhouse - "The comparison itself" is actually thousands of comparisons in that case, making the branch prediction far from meaningless. That's what I mean. (Although the "outer" branch prediction that only happens once...yes...relatively unimportant at that point.) – Tempura 16/3, 2018 at 18:30

This is a good explanation of how branch prediction works, and why it's important, but it doesn't seem to answer the OP's question, which as I understood it is already supposing the branches are well predicted: the question then is why is it free or nearly free given that the instructions to evaluate the branches still have to be executed, which presumably involves some "work" and hence "cost"? – Allin 16/3, 2018 at 20:9

Summary

Details

Recommended topics

Hot tags