No, it's not a branch, that's the whole point of cmovcc
.
It's an ALU select that has a data dependency on both inputs, not a control dependency. (With a memory source, it unconditionally loads the memory source, unlike ARM predicated load instructions that are truly NOPed. So you can't use it with maybe-bad pointers for branchless bounds or NULL checks. That's maybe the clearest illustration that it's definitely not a branch.)
But anyway, it's not predicted or speculated in any way; as far as the CPU scheduler is concerned it's just like an adc
instruction: 2 integer inputs + FLAGS, and 1 integer output. (Only difference from adc
/sbb
is that it doesn't write FLAGS. And of course runs on an execution unit with different internals).
Whether that's good or bad entirely depends on the use-case. See also gcc optimization flag -O3 makes code slower than -O2 for much more about cmov
upside / downside
Note that repne scasb
is not fast. "Fast Strings" only works for rep stos / movs.
repne scasb
runs about 1 count per clock cycle on modern CPUs, i.e. typically about 16x worse than a simple SSE2 pcmpeqb
/pmovmskb
/test+jnz
loop. And with clever optimization you can go even faster, up to 2 vectors per clock saturating the load ports.
(e.g. see glibc's memchr
for ORing pcmpeqb
results for a whole cache line together to feed one pmovmskb
, IIRC. Then go back and sort out where the actual hit was.)
repne scasb
also has startup overhead, but microcode branching is different from regular branching: it's not branch-predicted on Intel CPUs. So this can't mispredict, but is total garbage for performance with anything but very small buffers.
SSE2 is baseline for x86-64 and efficient unaligned loads + pmovmskb
make it a no-brainer for memchr
where you can check for length >= 16 to avoid crossing into an unmapped page.
Fast strlen:
cld
; all the standard calling conventions guarantee/require DF=0 on call/ret. Also,movzbl %sil, %eax
would be more efficient than merging into the low byte of RAX. Or justmov %esi, %eax
is good except if you caller only wrote AL on a P6-family CPU. – Bangsis cmov a branch
has several hits which all make it obvious, including Why is a conditional move not vulnerable for Branch Prediction Failure? (which is a possible duplicate). I don't think there's any real way to improve the question. Including any specific wrong claims or misleading sources would just lead to a more bloated answer that refutes them. – Bangs