As Matteo Italia already noted, this avoidance of conditional-move instructions is a quirk of GCC version 6. What he didn't notice, though, is that it applies only when optimizing for Intel processors.
With GCC 6.3, when targeting AMD processors (i.e., -march=
any of k8
, k10
, opteron
, amdfam10
, btver1
, bdver1
, btver2
, btver2
, bdver3
, bdver4
, znver1
, and possibly others), you get exactly the code you want:
mov esi, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
jmp .L2
.L7:
lea edx, [rax+rsi]
mov r8, QWORD PTR [rdi+8]
shr edx
mov r9d, edx
movss xmm1, DWORD PTR [r8+r9*4]
ucomiss xmm1, xmm0
cmovbe eax, edx
cmova esi, edx
.L2:
dec ecx
cmp ecx, -1
jne .L7
rep ret
When optimizing for any generation of Intel processor, GCC 6.3 avoids conditional moves, preferring an explicit branch:
mov r9d, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
.L2:
sub ecx, 1
cmp ecx, -1
je .L6
.L8:
lea edx, [rax+r9]
mov rsi, QWORD PTR [rdi+8]
shr edx
mov r8d, edx
vmovss xmm1, DWORD PTR [rsi+r8*4]
vucomiss xmm1, xmm0
ja .L4
sub ecx, 1
mov eax, edx
cmp ecx, -1
jne .L8
.L6:
ret
.L4:
mov r9d, edx
jmp .L2
The likely justification for this optimization decision is that conditional moves are fairly inefficient on Intel processors. CMOV
has a latency of 2 clock cycles on Intel processors compared to a 1-cycle latency on AMD. Additionally, while CMOV
instructions are decoded into multiple µops (at least two, with no opportunity for µop fusion) on Intel processors because of the requirement that a single µop has no more than two input dependencies (a conditional move has at least three: the two operands and the condition flag), AMD processors can implement a CMOV
with a single macro-operation since their design has no such limits on the input dependencies of a single macro-op. As such, the GCC optimizer is replacing branches with conditional moves only on AMD processors, where it might be a performance win—not on Intel processors and not when tuning for generic x86.
(Or, maybe the GCC devs just read Linus's infamous rant. :-)
Intriguingly, though, when you tell GCC to tune for the Pentium 4 processor (and you can't do this for 64-bit builds for some reason—GCC tells you that this architecture doesn't support 64-bit, even though there were definitely P4 processors that implemented EMT64), you do get conditional moves:
push edi
push esi
push ebx
mov esi, DWORD PTR [esp+16]
fld DWORD PTR [esp+20]
mov ebx, DWORD PTR [esi]
mov ecx, DWORD PTR [esi+4]
xor eax, eax
jmp .L2
.L8:
lea edx, [eax+ebx]
shr edx
mov edi, DWORD PTR [esi+8]
fld DWORD PTR [edi+edx*4]
fucomip st, st(1)
cmovbe eax, edx
cmova ebx, edx
.L2:
sub ecx, 1
cmp ecx, -1
jne .L8
fstp st(0)
pop ebx
pop esi
pop edi
ret
I suspect this is because branch misprediction is so expensive on Pentium 4, due to its extremely long pipeline, that the possibility of a single mispredicted branch outweighs any minor gains you might get from breaking loop-carried dependencies and the tiny amount of increased latency from CMOV
. Put another way: mispredicted branches got a lot slower on P4, but the latency of CMOV
didn't change, so this biases the equation in favor of conditional moves.
Tuning for later architectures, from Nocona to Haswell, GCC 6.3 goes back to its strategy of preferring branches over conditional moves.
So, although this looks like a major pessimization in the context of a tight inner loop (and it would look that way to me, too), don't be so quick to dismiss it out of hand without a benchmark to back up your assumptions. Sometimes, the optimizer is not as dumb as it looks. Remember, the advantage of a conditional move is that it avoids the penalty of branch mispredictions; the disadvantage of a conditional move is that it increases the length of a dependency chain and may require additional overhead because, on x86, only register→register or memory→register conditional moves are allowed (no constant→register). In this case, everything is already enregistered, but there is still the length of the dependency chain to consider. Agner Fog, in his Optimizing Subroutines in Assembly Language, gives us the following rule of thumb:
[W]e can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation ... when the other operand is chosen.
The second part of that doesn't apply here, but the first does. There is definitely a loop-carried dependency chain here, and unless you get into a really pathological case that disrupts branch prediction (which normally has a >90% accuracy), branching may actually be faster. In fact, Agner Fog continues:
Loop-carried dependency chains are particularly sensitive to the disadvantages of conditional moves. For example, [this code]
// Example 12.16a. Calculate pow(x,n) where n is a positive integer
double x, xp, power;
unsigned int n, i;
xp=x; power=1.0;
for (i = n; i != 0; i >>= 1) {
if (i & 1) power *= xp;
xp *= xp;
}
works more efficiently with a branch inside the loop than with a conditional move, even if the branch is poorly predicted. This is because the floating point conditional move adds to the loop-carried dependency chain and because the implementation with a conditional move has to calculate all the power*xp
values, even when they are not used.
Another example of a loop-carried dependency chain is a binary search in a sorted list. If the items to search for are randomly distributed over the entire list then the branch prediction rate will be close to 50% and it will be faster to use conditional moves. But if the items are often close to each other so that the prediction rate will be better, then it is more efficient to use conditional jumps than conditional moves because the dependency chain is broken every time a correct branch prediction is made.
If the items in your list are actually random or close to random, then you'll be the victim of repeated branch-prediction failure, and conditional moves will be faster. Otherwise, in what is probably the more common case, branch prediction will succeed >75% of the time, such that you will experience a performance win from branching, as opposed to a conditional move that would extend the dependency chain.
It's hard to reason about this theoretically, and it's even harder to guess correctly, so you need to actually benchmark it with real-world numbers.
If your benchmarks confirm that conditional moves really would be faster, you have a couple of options:
- Upgrade to a later version of GCC, like 7.1, that generate conditional moves in 64-bit builds even when targeting Intel processors.
- Tell GCC 6.3 to optimize your code for AMD processors. (Maybe even just having it optimize one particular code module, so as to minimize the global effects.)
Get really creative (and ugly and potentially non-portable), writing some bit-twiddling code in C that does the comparison-and-set operation branchlessly. This might get the compiler to emit a conditional-move instruction, or it might get the compiler to emit a series of bit-twiddling instructions. You'd have to check the output to be sure, but if your goal is really just to avoid branch misprediction penalties, then either will work.
For example, something like this:
inline uint32 ConditionalSelect(bool condition, uint32 value1, uint32 value2)
{
const uint32 mask = condition ? static_cast<uint32>(-1) : 0;
uint32 result = (value1 ^ value2); // get bits that differ between the two values
result &= mask; // select based on condition
result ^= value2; // condition ? value1 : value2
return result;
}
which you would then call inside of your inner loop like so:
hi = ConditionalSelect(z < xi[mid], mid, hi);
lo = ConditionalSelect(z < xi[mid], lo, mid);
GCC 6.3 produces the following code for this when targeting x86-64:
mov rdx, QWORD PTR [rdi+8]
mov esi, DWORD PTR [rdi]
test edx, edx
mov eax, edx
lea r8d, [rdx-1]
je .L1
mov r9, QWORD PTR [rdi+16]
xor eax, eax
.L3:
lea edx, [rax+rsi]
shr edx
mov ecx, edx
mov edi, edx
movss xmm1, DWORD PTR [r9+rcx*4]
xor ecx, ecx
ucomiss xmm1, xmm0
seta cl // <-- begin our bit-twiddling code
xor edi, esi
xor eax, edx
neg ecx
sub r8d, 1 // this one's not part of our bit-twiddling code!
and edi, ecx
and eax, ecx
xor esi, edi
xor eax, edx // <-- end our bit-twiddling code
cmp r8d, -1
jne .L3
.L1:
rep ret
Notice that the inner loop is entirely branchless, which is exactly what you wanted. It may not be quite as efficient as two CMOV
instructions, but it will be faster than chronically mispredicted branches. (It goes without saying that GCC and any other compiler will be smart enough to inline the ConditionalSelect
function, which allows us to write it out-of-line for readability purposes.)
However, what I would definitely not recommend is that you rewrite any part of the loop using inline assembly. All of the standard reasons apply for avoiding inline assembly, but in this instance, even the desire for increased performance isn't a compelling reason to use it. You're more likely to confuse the compiler's optimizer if you try to throw inline assembly into the middle of that loop, resulting in sub-par code worse than what you would have gotten otherwise if you'd just left the compiler to its own devices. You'd probably have to write the entire function in inline assembly to get good results, and even then, there could be spill-over effects from this when GCC's optimizer tried to inline the function.
What about MSVC? Well, different compilers have different optimizers and therefore different code-generation strategies. Things can start to get really ugly really quickly if you have your heart set on cajoling all target compilers to emit a particular sequence of assembly code.
On MSVC 19 (VS 2015), when targeting 32-bit, you can write the code the way you did to get conditional-move instructions. But this doesn't work when building a 64-bit binary: you get branches instead, just like with GCC 6.3 targeting Intel.
There is a nice solution, though, that works well: use the conditional operator. In other words, if you write the code like this:
hi = (z < xi[mid]) ? mid : hi;
lo = (z < xi[mid]) ? lo : mid;
then VS 2013 and 2015 will always emit CMOV
instructions, whether you're building a 32-bit or 64-bit binary, whether you're optimizing for size (/O1
) or speed (/O2
), and whether you're optimizing for Intel (/favor:Intel64
) or AMD (/favor:AMD64
).
This does fail to result in CMOV
instructions back on VS 2010, but only when building 64-bit binaries. If you needed to ensure that this scenario also generated branchless code, then you could use the above ConditionalSelect
function.
-O2
vs-O3
,-march=native
, etc.? What flags are you using? – Schererlikely
/unlikely
macros; don't know what's the VS equivalent). But even if on your current setting it causescmov
to be used, it's neither guaranteed to be so once you change your code/compiler/mood/karma, nor be better than the code generated without the hint. – Bipartisan-O3
usescmovbe
/cmova
(4.4 instead uses a branch). OTOH, as it has been said, it's all very heuristic-dependent.likely
/unlikely
probably won't help, given thatcmov
is usually convenient exactly when it's unknown/random whether the branch will be taken. – Dilatation__builtin_expect
is called__assume
. – Struvecmov
s, while the gcc 6 series uses the branch. Probably it's either some regression, or they altered the heuristics in some way just for those versions. – Dilatation__builtin_expect
will be useful here, just that it might affect the compiler's heuristics to pick the predicated instructions. Btw,__builtin_expect
might not be of any help even if branches almost always go the same way - on such cases the branch predictor will guess correctly anyway. – Bipartisanxi
then CMOV should be faster, but there's any locality or pattern in the searches then branches could be faster. – Strongarm