Summary
Evaluating a branch condition always takes some work, even if perfectly predicted, but because of the internal parallelism in modern CPUs extra work doesn't necessary add to the cost of a particular instruction sequence.
Details
I think part of the confusion lies in the mental performance model many people have for the execution of CPU instructions. Yes, every instruction requires some work, so that should imply that every instruction has some cost, however small, when measured in execution time, right?
Well that would be true if the total cost of execution was simply additive in the work for each instruction - you just add together all the work and get the final cost. Because of the large about of parallelism in modern CPUs it doesn't work like that.
Think of it like organizing a birthday party. You may have to buy flour which takes 10 minutes and then bake a cake which takes 60 minutes, and go pick up a special gift which is 30 minutes away. Those timings are all the "work" required for the activity. However, someone can go pick up the gift while the flour is being picked up and the cake is being being baked. You can't bake the cake without the flour, however. So you have two dependency chains: the 70 minute buy flour -> bake cake chain, and the 30 minute pickup gift chain. With unlimited parallelism, only the 70 minute cake related chain contributes to the time at which everything is ready. Picking up the gift 30 minutes of work but it ends up costing no time (not delaying completion of all the tasks), due to other work that takes longer (aka the critical path) and happens in parallel.
More extra tasks can be done in parallel until you run out of people to assign to them. (At that point, execution throughput limits start to increase latency, and this is called a resource conflict. If a resource conflict delays the critical path, rather than one of the shorter dependency chains. CPUs don't know which dependency chain is / will be the critical path, so their scheduling doesn't prioritize it the way smart humans would in this planning analogy.)
For a less abstract and more practical look at look at how this stuff applies directly to CPUs, see A Whirlwind Introduction to Dataflow Graphs.
Once we have this new mental model where the cost of an instruction sequence is often dominated by the some critical path though the sequence, we can start to see why well-predicted branches are often very low or zero cost:
- Branch instructions have no ouput register and no memory output1. This means they can't participate in typical dependency chains except as the final node - they always end a dependency chain. So branches don't participate in the formation of long dependency chains and thus are in some sense "out of line" and free to be calculated in parallel with other results.
- The actual execution of branch instructions generally needs very little work: on modern x86 they can execute on two ports, with 1 cycle latency. Furthermore, branch instructions can be fused with a previous ALU operation, and the resulting operation still executes in 1 cycle - so in some sense the branch can sometimes be folded into a prior operation for no additional work at execution2. This obvious helps the "near zero cost" argument, but also helps the "truly zero cost" argument, since needing fewer resources means that it is less likely to trigger a throughput bottleneck that would disturb a zero cost execution schedule.
Those factors combine to make most predicted branch instructions zero cost or nearly zero cost.
You don't have to take my word for it, let's look at a real example:
int mul1(int count, int x) {
do {
x *= 111;
} while (--count);
return x;
}
Given a count
and a starting value x
, it multiplies x
by 111 count
times and returns the result. The loop assembles to 3 instructions One for the multiply, one for the --count
and a branch to check the count
value:
.L2:
imul eax, eax, 111
sub edi, 1
jne .L2
Now here's the same loop, but with an additional branch:
int mul2(int count, int x) {
do {
x *= 111;
if (x == 0) {
abort();
}
} while (--count);
return x;
}
This assembles to 5 instructions. The extra two are for the test of x
and the branch the test shows that x
is zero:
.L7:
imul eax, eax, 111
test eax, eax
je .L12 ; ends up calling abort
sub edi, 1
jne .L7
So what is the cost of adding 60% more instructions, including a branch? Zero, at least to 4 significant digits3:
Running benchmarks groups using timer libpfc
** Running benchmark group stackoverflow tests **
Benchmark Cycles
No branch 3.000
Added test-branch 3.000
The look takes 3 cycles per iteration, because it is limited by the dependency chain involving 3-cycle multiply. The additional instructions and branch didn't cost anything because they didn't add to this dependency chain and were able to execute "out of line", hiding behind the latency of the critical path.
1 Conceptually, branch instructions write the "rip" register, but this isn't treated like the other registers at all: its progression is predicted ahead of time, so the dependency is broken by the predictor.
2 Of course, there is still additional work to decode and fuse the instruction in the first place, but this is often not the bottleneck so may be "free" in cost terms, and things like uop caches means that it may not even be performed frequently. Also, on x86, while a fused branch instruction has the same latency as an ALU op, it is less flexible in terms of which ports it can execute on, so depending on port pressure it may be the case that a fused instruction has some cost compared to the bare ALU op.
3 In fact, if you go to "infinite" significant digits and look at raw cycle counts, you see that additional iterations of this loop cost exactly 3 cycles in both cases. The no-branch case does usually end up 1 cycle shorter overall (a difference that goes to 0 in a relative sense as the iterations increase), perhaps because the initial non-steady-state iteration takes an additional cycle, or the misprediction recovery takes an additional cycle on the final iteration.