Is CMOVcc considered a branching instruction?
Asked Answered
R

1

2

I have this memchr code that I'm trying to make non-branching:

.globl memchr
memchr:
        mov %rdx, %rcx
        mov %sil, %al
        cld
        repne scasb
        lea -1(%rdi), %rax
        test %rcx, %rcx
        cmove %rcx, %rax
        ret

I'm unsure whether or not cmove is a branching instruction. Is it? If so, how do I rearrange my code so it doesn't branch?

Reprisal answered 16/8, 2019 at 12:8 Comment(2)
You don't need cld; all the standard calling conventions guarantee/require DF=0 on call/ret. Also, movzbl %sil, %eax would be more efficient than merging into the low byte of RAX. Or just mov %esi, %eax is good except if you caller only wrote AL on a P6-family CPU.Bangs
I assume downvoted for lack of research effort. e.g. google for is cmov a branch has several hits which all make it obvious, including Why is a conditional move not vulnerable for Branch Prediction Failure? (which is a possible duplicate). I don't think there's any real way to improve the question. Including any specific wrong claims or misleading sources would just lead to a more bloated answer that refutes them.Bangs
B
18

No, it's not a branch, that's the whole point of cmovcc.

It's an ALU select that has a data dependency on both inputs, not a control dependency. (With a memory source, it unconditionally loads the memory source, unlike ARM predicated load instructions that are truly NOPed. So you can't use it with maybe-bad pointers for branchless bounds or NULL checks. That's maybe the clearest illustration that it's definitely not a branch.)

But anyway, it's not predicted or speculated in any way; as far as the CPU scheduler is concerned it's just like an adc instruction: 2 integer inputs + FLAGS, and 1 integer output. (Only difference from adc/sbb is that it doesn't write FLAGS. And of course runs on an execution unit with different internals).

Whether that's good or bad entirely depends on the use-case. See also gcc optimization flag -O3 makes code slower than -O2 for much more about cmov upside / downside


Note that repne scasb is not fast. "Fast Strings" only works for rep stos / movs.

repne scasb runs about 1 count per clock cycle on modern CPUs, i.e. typically about 16x worse than a simple SSE2 pcmpeqb/pmovmskb/test+jnz loop. And with clever optimization you can go even faster, up to 2 vectors per clock saturating the load ports.

(e.g. see glibc's memchr for ORing pcmpeqb results for a whole cache line together to feed one pmovmskb, IIRC. Then go back and sort out where the actual hit was.)

repne scasb also has startup overhead, but microcode branching is different from regular branching: it's not branch-predicted on Intel CPUs. So this can't mispredict, but is total garbage for performance with anything but very small buffers.

SSE2 is baseline for x86-64 and efficient unaligned loads + pmovmskb make it a no-brainer for memchr where you can check for length >= 16 to avoid crossing into an unmapped page.

Fast strlen:

Bangs answered 16/8, 2019 at 12:13 Comment(9)
"Note that repne scasb is not fast." -- I know. I'm planning on replacing it with something faster later, but for now it's small and it works.Reprisal
I also don't have any experience with any SSE/2/3/4.1 stuff.Reprisal
@JL2210: memchr is a pretty good way to learn about using SSE2 for simple element-match searching; knowing the size as a function arg (instead of implicit length) makes it significantly simpler than strchr.Bangs
@JL2210 You might be interested in this answer, which discusses ways of optimizing strlen using basic x86 instructions (i.e., without using any SSE). A similar approach could be used for memchr, except that you can't optimize so heavily around searching for a 0-byte. Certainly, as Peter says, SSE is the way to make it really fast. I've written and benchmarked that code, too. If you want to know more, ask more questions about it. I could expand that answer another 10 pages, but I'd eventually run out of room and folks would tire of reading.Isomerize
@CodyGray Thanks for the offer.Reprisal
@JL2210: Updated my answer with some links to SIMD strlen as described in this answer. memchr using the zero-byte-in-word bithack usually just does an xor with the byte being searched for. You can broadcast it with c * 0x01010101UL or whatever.Bangs
@JL2210: x86-64 guarantees availability of SSE2 for pcmpeqb. If you need 4-byte or 8-byte chunks, movd or movq+pcmpeqd should beat a scalar bithack even in the no-false-positive case. It's also pointless if you know SSE2 is available in 32-bit code, so updated my answer to say "with SSE2".Bangs
@PeterCordes Say I made a C version of that same code. Would it still be reasonably more efficient than a byte-for-byte loop?Reprisal
@JL2210: Yes, for all but the shortest of strings, because current compilers (except ICC) can't auto-vectorize search loops. But this is an assembly-language question. But to make an unsigned long version without violating strict aliasing in C you'd need memcpy for the aliasing loads. Or you could use Intel SIMD intrinscs. Or using GNU C typedef unsigned long aliasing_long __attribute((may_alias)). Although that would be useful for a portable version: Why does glibc's strlen need to be so complicated to run quickly?Bangs

© 2022 - 2024 — McMap. All rights reserved.