how to force the use of cmov in gcc and VS
Asked Answered
C

2

9

I have this simple binary search member function, where lastIndex, nIter and xi are class members:

uint32 scalar(float z) const
{
    uint32 lo = 0;
    uint32 hi = lastIndex;
    uint32 n = nIter;
    while (n--) {
        int mid = (hi + lo) >> 1;
        // defining this if-else assignment as below cause VS2015
        // to generate two cmov instructions instead of a branch
        if( z < xi[mid] ) 
            hi = mid;
        if ( !(z < xi[mid]) )
            lo = mid;
    }
    return lo;
}

Both gcc and VS 2015 translate the inner loop with a code flow branch:

000000013F0AA778  movss       xmm0,dword ptr [r9+rax*4]  
000000013F0AA77E  comiss      xmm0,xmm1  
000000013F0AA781  jbe         Tester::run+28h (013F0AA788h) 
000000013F0AA783  mov         r8d,ecx  
000000013F0AA786  jmp         Tester::run+2Ah (013F0AA78Ah)  
000000013F0AA788  mov         edx,ecx  
000000013F0AA78A  mov         ecx,r8d

Is there a way, without writing assembler inline, to convince them to use exactly 1 comiss instruction and 2 cmov instructions?

If not, can anybody suggest how to write a gcc assembler template for this?

Please note that I am aware that there are variations of the binary search algorithm where it is easy for the compiler to generate branch free code, but this is beside the question.

Thanks

Creight answered 26/6, 2017 at 6:43 Comment(12)
Have you tried -O2 vs -O3, -march=native, etc.? What flags are you using?Scherer
In addition to optimization settings, branch assembly generated by GCC can also be affected by the use of __builtin_expect (commonly wrapped in likely/unlikely macros; don't know what's the VS equivalent). But even if on your current setting it causes cmov to be used, it's neither guaranteed to be so once you change your code/compiler/mood/karma, nor be better than the code generated without the hint.Bipartisan
A quick survey at gcc.godbolt.org shows that every gcc version from 4.5 onwards at -O3 uses cmovbe/cmova (4.4 instead uses a branch). OTOH, as it has been said, it's all very heuristic-dependent. likely/unlikely probably won't help, given that cmov is usually convenient exactly when it's unknown/random whether the branch will be taken.Dilatation
The VS equivalent of __builtin_expect is called __assume.Struve
@Brett Hale, The compiler options I use are: -std=c++11 -msse4.2 -O3.Creight
@eran, I do not think __builtin_expect is of any help here. The probability of each branch is exactly 50%.Creight
@Matteo, I am working with gcc 6.3. If there was just 1 assignment, it would generate a cmov. With 2 assignments, it does not.Creight
@Fabio: that's bizarre, I just checked on gcc.godbolt.org shows that both gcc 5.4 and gcc 7.1 (and most other versions I checked) generate the two cmovs, while the gcc 6 series uses the branch. Probably it's either some regression, or they altered the heuristics in some way just for those versions.Dilatation
@Fabio, I'm not suggesting __builtin_expect will be useful here, just that it might affect the compiler's heuristics to pick the predicated instructions. Btw, __builtin_expect might not be of any help even if branches almost always go the same way - on such cases the branch predictor will guess correctly anyway.Bipartisan
Are you sure that CMOV instructions would actually be faster? In your example code they would have a serious disadvantage of causing each loop iteration to be dependent on the result of the previous iteration. When the branches can be predicted correctly the loop iterations can be run in parallel. If the nature of what you're searching for is going to resulting a random walk through the elements of xi then CMOV should be faster, but there's any locality or pattern in the searches then branches could be faster.Strongarm
@Ross, it is a random walk. I am sure cmov would help.Creight
Convert GCC Inline Assembly CMOV to Visual Studio AssemblerOviduct
D
6

As said in the comments, there's no easy way to force what you are asking, although it seems that recent (>4.4) versions of gcc already optimize it like you said. Edit: interestingly, the gcc 6 series seems to use a branch, unlike both the gcc 5 and gcc 7 series, which use two cmov.

The usual __builtin_expect probably cannot do much into pushing gcc to use cmov, given that cmov is generally convenient when it's difficult to predict the result of a comparison, while __builtin_expect tells the compiler what is the likely outcome - so you would be just pushing it in the wrong direction.

Still, if you find that this optimization is extremely important, your compiler version typically gets it wrong and for some reason you cannot help it with PGO, the relevant gcc assembly template should be something like:

    __asm__ (
        "comiss %[xi_mid],%[z]\n"
        "cmovb %[mid],%[hi]\n"
        "cmovae %[mid],%[lo]\n"
        : [hi] "+r"(hi), [lo] "+r"(lo)
        : [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
        : "cc"
    );

The used constraints are:

  • hi and lo are into the "write" variables list, with +r constraint as cmov can only work with registers as target operands, and we are conditionally overwriting just one of them (we cannot use =, as it implies that the value is always overwritten, so the compiler would be free to give us a different target register than the current one, and use it to refer to that variable after our asm block);
  • mid is in the "read" list, rm as cmov can take either a register or a memory operand as input value;
  • xi[mid] and z are in the "read" list;
    • z has the special x constraint that means "any SSE register" (required for ucomiss first operand);
    • xi[mid] has xm, as the second ucomiss operand allows a memory operator; given the choice between z and xi[mid], I chose the last one as a better candidate for being taken directly from memory, given that z is already in a register (due to the System V calling convention - and is going to be cached between iterations anyway) and xi[mid] is used just in this comparison;
  • cc (the FLAGS register) is in the "clobber" list - we do clobber the flags and nothing else.
Dilatation answered 26/6, 2017 at 8:47 Comment(9)
Are you sure about that asm? By using "=r", you are saying "ignore the current contents of the variable cuz the asm is going to overwrite it." But then you don't always set values for both outputs, since only 1 of those cmovs is going to execute. AAR, wouldn't one of these values always be undefined? Perhaps you meant "+r"? I might also use symbolic names to make it easier to read.Chaumont
@DavidWohlferd: aaand that's why I should always think twice when writing extended asm. Yes, that's definitely +r. About symbolic names, I learn of their existence in this moment, I'll definitely look them up, since with 5 parameters things get unwieldy quickly.Dilatation
This seems what I need, although getting it to work is driving me crazy. I run the functions several hundreds thousands times in two nested loops. In the first iteration of the outer loop the inner loop is now fast and it completes in the timing I was expecting, in the second iteration, it is excruciatingly slow. I think something might be getting corrupted. Still the results of the function are all correct. I am trying to narrow this down to a minimal example.Creight
@Fabio: what was the problem? Did I mess up something or it was something else?Dilatation
After a full rebuild of the project it started to work as expected. No problem with your code.Creight
@Fabio: oh ok, it's the usual mysterious full rebuild.Dilatation
@Matteo, the problem started to show up again. It was not a rebuild issue. Sometimes I compile with -msse2 and sometimes with -mavx2. Care must be taken to ensure the relevant comiss instruction is used, otherwise there is a massive penalty for mixing SSE and AVX instructions. So I added an #ifdef __AVX2__ to choose between comiss and vcomiss.Creight
The ternary operator is usually step 1 when trying to hand-hold a compiler into making branchless code, rather than making it convert ifs. Probably be good to suggest that before inline asm.Hombre
you should generalize this code and make it look like a compiler intrinsic, like this: __conditional_move(dst, conditon, src); which translates to if (condition) { dst = src }Stome
C
19

As Matteo Italia already noted, this avoidance of conditional-move instructions is a quirk of GCC version 6. What he didn't notice, though, is that it applies only when optimizing for Intel processors.

With GCC 6.3, when targeting AMD processors (i.e., -march= any of k8, k10, opteron, amdfam10, btver1, bdver1, btver2, btver2, bdver3, bdver4, znver1, and possibly others), you get exactly the code you want:

    mov     esi, DWORD PTR [rdi]
    mov     ecx, DWORD PTR [rdi+4]
    xor     eax, eax
    jmp     .L2
.L7:
    lea     edx, [rax+rsi]
    mov     r8, QWORD PTR [rdi+8]
    shr     edx
    mov     r9d, edx
    movss   xmm1, DWORD PTR [r8+r9*4]
    ucomiss xmm1, xmm0
    cmovbe  eax, edx
    cmova   esi, edx
.L2:
    dec     ecx
    cmp     ecx, -1
    jne     .L7
    rep ret

When optimizing for any generation of Intel processor, GCC 6.3 avoids conditional moves, preferring an explicit branch:

    mov      r9d, DWORD PTR [rdi]
    mov      ecx, DWORD PTR [rdi+4]
    xor      eax, eax
.L2:
    sub      ecx, 1
    cmp      ecx, -1
    je       .L6
.L8:
    lea      edx, [rax+r9]
    mov      rsi, QWORD PTR [rdi+8]
    shr      edx
    mov      r8d, edx
    vmovss   xmm1, DWORD PTR [rsi+r8*4]
    vucomiss xmm1, xmm0
    ja       .L4
    sub      ecx, 1
    mov      eax, edx
    cmp      ecx, -1
    jne      .L8
.L6:
    ret
.L4:
    mov      r9d, edx
    jmp      .L2

The likely justification for this optimization decision is that conditional moves are fairly inefficient on Intel processors. CMOV has a latency of 2 clock cycles on Intel processors compared to a 1-cycle latency on AMD. Additionally, while CMOV instructions are decoded into multiple µops (at least two, with no opportunity for µop fusion) on Intel processors because of the requirement that a single µop has no more than two input dependencies (a conditional move has at least three: the two operands and the condition flag), AMD processors can implement a CMOV with a single macro-operation since their design has no such limits on the input dependencies of a single macro-op. As such, the GCC optimizer is replacing branches with conditional moves only on AMD processors, where it might be a performance win—not on Intel processors and not when tuning for generic x86.

(Or, maybe the GCC devs just read Linus's infamous rant. :-)

Intriguingly, though, when you tell GCC to tune for the Pentium 4 processor (and you can't do this for 64-bit builds for some reason—GCC tells you that this architecture doesn't support 64-bit, even though there were definitely P4 processors that implemented EMT64), you do get conditional moves:

    push    edi
    push    esi
    push    ebx
    mov     esi, DWORD PTR [esp+16]
    fld     DWORD PTR [esp+20]
    mov     ebx, DWORD PTR [esi]
    mov     ecx, DWORD PTR [esi+4]
    xor     eax, eax
    jmp     .L2
.L8:
    lea     edx, [eax+ebx]
    shr     edx
    mov     edi, DWORD PTR [esi+8]
    fld     DWORD PTR [edi+edx*4]
    fucomip st, st(1)
    cmovbe  eax, edx
    cmova   ebx, edx
.L2:
    sub     ecx, 1
    cmp     ecx, -1
    jne     .L8
    fstp    st(0)
    pop     ebx
    pop     esi
    pop     edi
    ret

I suspect this is because branch misprediction is so expensive on Pentium 4, due to its extremely long pipeline, that the possibility of a single mispredicted branch outweighs any minor gains you might get from breaking loop-carried dependencies and the tiny amount of increased latency from CMOV. Put another way: mispredicted branches got a lot slower on P4, but the latency of CMOV didn't change, so this biases the equation in favor of conditional moves.

Tuning for later architectures, from Nocona to Haswell, GCC 6.3 goes back to its strategy of preferring branches over conditional moves.

So, although this looks like a major pessimization in the context of a tight inner loop (and it would look that way to me, too), don't be so quick to dismiss it out of hand without a benchmark to back up your assumptions. Sometimes, the optimizer is not as dumb as it looks. Remember, the advantage of a conditional move is that it avoids the penalty of branch mispredictions; the disadvantage of a conditional move is that it increases the length of a dependency chain and may require additional overhead because, on x86, only register→register or memory→register conditional moves are allowed (no constant→register). In this case, everything is already enregistered, but there is still the length of the dependency chain to consider. Agner Fog, in his Optimizing Subroutines in Assembly Language, gives us the following rule of thumb:

[W]e can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation ... when the other operand is chosen.

The second part of that doesn't apply here, but the first does. There is definitely a loop-carried dependency chain here, and unless you get into a really pathological case that disrupts branch prediction (which normally has a >90% accuracy), branching may actually be faster. In fact, Agner Fog continues:

Loop-carried dependency chains are particularly sensitive to the disadvantages of conditional moves. For example, [this code]

// Example 12.16a. Calculate pow(x,n) where n is a positive integer
double x, xp, power;
unsigned int n, i;
xp=x; power=1.0;
for (i = n; i != 0; i >>= 1) {
    if (i & 1) power *= xp;
    xp *= xp;
}

works more efficiently with a branch inside the loop than with a conditional move, even if the branch is poorly predicted. This is because the floating point conditional move adds to the loop-carried dependency chain and because the implementation with a conditional move has to calculate all the power*xp values, even when they are not used.

Another example of a loop-carried dependency chain is a binary search in a sorted list. If the items to search for are randomly distributed over the entire list then the branch prediction rate will be close to 50% and it will be faster to use conditional moves. But if the items are often close to each other so that the prediction rate will be better, then it is more efficient to use conditional jumps than conditional moves because the dependency chain is broken every time a correct branch prediction is made.

If the items in your list are actually random or close to random, then you'll be the victim of repeated branch-prediction failure, and conditional moves will be faster. Otherwise, in what is probably the more common case, branch prediction will succeed >75% of the time, such that you will experience a performance win from branching, as opposed to a conditional move that would extend the dependency chain.

It's hard to reason about this theoretically, and it's even harder to guess correctly, so you need to actually benchmark it with real-world numbers.

If your benchmarks confirm that conditional moves really would be faster, you have a couple of options:

  1. Upgrade to a later version of GCC, like 7.1, that generate conditional moves in 64-bit builds even when targeting Intel processors.
  2. Tell GCC 6.3 to optimize your code for AMD processors. (Maybe even just having it optimize one particular code module, so as to minimize the global effects.)
  3. Get really creative (and ugly and potentially non-portable), writing some bit-twiddling code in C that does the comparison-and-set operation branchlessly. This might get the compiler to emit a conditional-move instruction, or it might get the compiler to emit a series of bit-twiddling instructions. You'd have to check the output to be sure, but if your goal is really just to avoid branch misprediction penalties, then either will work.

    For example, something like this:

    inline uint32 ConditionalSelect(bool condition, uint32 value1, uint32 value2)
    {
       const uint32 mask = condition ? static_cast<uint32>(-1) : 0;
       uint32 result = (value1 ^ value2);   // get bits that differ between the two values
       result &= mask;                      // select based on condition
       result ^= value2;                    // condition ? value1 : value2
       return result;
    }
    

    which you would then call inside of your inner loop like so:

    hi = ConditionalSelect(z < xi[mid], mid, hi);
    lo = ConditionalSelect(z < xi[mid], lo, mid);
    

    GCC 6.3 produces the following code for this when targeting x86-64:

        mov     rdx, QWORD PTR [rdi+8]
        mov     esi, DWORD PTR [rdi]
        test    edx, edx
        mov     eax, edx
        lea     r8d, [rdx-1]
        je      .L1
        mov     r9, QWORD PTR [rdi+16]
        xor     eax, eax
    .L3:
        lea     edx, [rax+rsi]
        shr     edx
        mov     ecx, edx
        mov     edi, edx
        movss   xmm1, DWORD PTR [r9+rcx*4]
        xor     ecx, ecx
        ucomiss xmm1, xmm0
        seta    cl               // <-- begin our bit-twiddling code
        xor     edi, esi
        xor     eax, edx
        neg     ecx
        sub     r8d, 1           // this one's not part of our bit-twiddling code!
        and     edi, ecx
        and     eax, ecx
        xor     esi, edi
        xor     eax, edx         // <-- end our bit-twiddling code
        cmp     r8d, -1
        jne     .L3
    .L1:
        rep ret
    

    Notice that the inner loop is entirely branchless, which is exactly what you wanted. It may not be quite as efficient as two CMOV instructions, but it will be faster than chronically mispredicted branches. (It goes without saying that GCC and any other compiler will be smart enough to inline the ConditionalSelect function, which allows us to write it out-of-line for readability purposes.)

However, what I would definitely not recommend is that you rewrite any part of the loop using inline assembly. All of the standard reasons apply for avoiding inline assembly, but in this instance, even the desire for increased performance isn't a compelling reason to use it. You're more likely to confuse the compiler's optimizer if you try to throw inline assembly into the middle of that loop, resulting in sub-par code worse than what you would have gotten otherwise if you'd just left the compiler to its own devices. You'd probably have to write the entire function in inline assembly to get good results, and even then, there could be spill-over effects from this when GCC's optimizer tried to inline the function.


What about MSVC? Well, different compilers have different optimizers and therefore different code-generation strategies. Things can start to get really ugly really quickly if you have your heart set on cajoling all target compilers to emit a particular sequence of assembly code.

On MSVC 19 (VS 2015), when targeting 32-bit, you can write the code the way you did to get conditional-move instructions. But this doesn't work when building a 64-bit binary: you get branches instead, just like with GCC 6.3 targeting Intel.

There is a nice solution, though, that works well: use the conditional operator. In other words, if you write the code like this:

hi = (z < xi[mid]) ? mid : hi;
lo = (z < xi[mid]) ? lo  : mid;

then VS 2013 and 2015 will always emit CMOV instructions, whether you're building a 32-bit or 64-bit binary, whether you're optimizing for size (/O1) or speed (/O2), and whether you're optimizing for Intel (/favor:Intel64) or AMD (/favor:AMD64).

This does fail to result in CMOV instructions back on VS 2010, but only when building 64-bit binaries. If you needed to ensure that this scenario also generated branchless code, then you could use the above ConditionalSelect function.

Crosslink answered 26/6, 2017 at 12:39 Comment(9)
+1 for the VS advice. I will test it tomorrow. On gcc, unfortunately I am stuck with the version shipped with Cygwin. I cannot use the bitwise operations because I am running a one-off test to measure the impact of cmov instructions in certain algorithms, which obviously I cannot do if I do not use them.Creight
If it's a one-off test, then sure—use inline assembly. In fact, just write the whole darn routine in assembly language so you have precise control over which instructions are generated and how. That's what I would do, and have done when benchmarking code. Start with the compiler's output so you don't waste a bunch of time writing the asm; all you need to do is tweak it. @CreightCrosslink
Confirm it works beautifully on VS2015. Thank you. Unfortunately I can only nay accept only one answer.Creight
and you can't do this for 64-bit builds for some reason: gcc -march=nocona is 64-bit P4-Prescott. -march=pentium4 means Northwood / Willamette, I think, and -march=prescott is still 32-bit only.Hombre
The ternary operator should be higher up: always writing the destination variable is often the trick to convince compilers to use cmov (or branchless in general) if they're "reluctant". Or to allow auto-vectorization with read / modify / write instead of having to avoid writing elements the C abstract machine doesn't write. But it still sometimes applies even for locals, I think even with gcc sometimes. As you say, gcc can convert if to cmov in many cases, though.Hombre
on x86, only register-register conditional moves are allowed -- aren't conditional memory loads allowed? (since P6/Pentium II)Watchword
Not with the cmov instruction, @ZachB. That's register-to-register only. Are you perhaps thinking of non-temporal loads that SSE introduced (e.g., MOVNTDQA)? That's a whole different can of worms.Crosslink
@CodyGray am I reading this wrong? felixcloutier.com/x86/cmovcc These instructions can move 16-bit, 32-bit or 64-bit values from memory to a general-purpose register or from one general-purpose register to another.Watchword
@Watchword Ah, yes. You are correct. Not sure what I was thinking yesterday in the comment! Answer updated. Thanks for the persistence. In the original answer, what I was thinking about was not reg-mem moves, but rather, reg-const moves. The latter would be much nicer, because it would mean not having to use MOV instructions to set up the common uses of CMOV to create branchless code, and would also mean not having to consume so many registers (of which there are already a very limited number on x86).Crosslink
D
6

As said in the comments, there's no easy way to force what you are asking, although it seems that recent (>4.4) versions of gcc already optimize it like you said. Edit: interestingly, the gcc 6 series seems to use a branch, unlike both the gcc 5 and gcc 7 series, which use two cmov.

The usual __builtin_expect probably cannot do much into pushing gcc to use cmov, given that cmov is generally convenient when it's difficult to predict the result of a comparison, while __builtin_expect tells the compiler what is the likely outcome - so you would be just pushing it in the wrong direction.

Still, if you find that this optimization is extremely important, your compiler version typically gets it wrong and for some reason you cannot help it with PGO, the relevant gcc assembly template should be something like:

    __asm__ (
        "comiss %[xi_mid],%[z]\n"
        "cmovb %[mid],%[hi]\n"
        "cmovae %[mid],%[lo]\n"
        : [hi] "+r"(hi), [lo] "+r"(lo)
        : [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
        : "cc"
    );

The used constraints are:

  • hi and lo are into the "write" variables list, with +r constraint as cmov can only work with registers as target operands, and we are conditionally overwriting just one of them (we cannot use =, as it implies that the value is always overwritten, so the compiler would be free to give us a different target register than the current one, and use it to refer to that variable after our asm block);
  • mid is in the "read" list, rm as cmov can take either a register or a memory operand as input value;
  • xi[mid] and z are in the "read" list;
    • z has the special x constraint that means "any SSE register" (required for ucomiss first operand);
    • xi[mid] has xm, as the second ucomiss operand allows a memory operator; given the choice between z and xi[mid], I chose the last one as a better candidate for being taken directly from memory, given that z is already in a register (due to the System V calling convention - and is going to be cached between iterations anyway) and xi[mid] is used just in this comparison;
  • cc (the FLAGS register) is in the "clobber" list - we do clobber the flags and nothing else.
Dilatation answered 26/6, 2017 at 8:47 Comment(9)
Are you sure about that asm? By using "=r", you are saying "ignore the current contents of the variable cuz the asm is going to overwrite it." But then you don't always set values for both outputs, since only 1 of those cmovs is going to execute. AAR, wouldn't one of these values always be undefined? Perhaps you meant "+r"? I might also use symbolic names to make it easier to read.Chaumont
@DavidWohlferd: aaand that's why I should always think twice when writing extended asm. Yes, that's definitely +r. About symbolic names, I learn of their existence in this moment, I'll definitely look them up, since with 5 parameters things get unwieldy quickly.Dilatation
This seems what I need, although getting it to work is driving me crazy. I run the functions several hundreds thousands times in two nested loops. In the first iteration of the outer loop the inner loop is now fast and it completes in the timing I was expecting, in the second iteration, it is excruciatingly slow. I think something might be getting corrupted. Still the results of the function are all correct. I am trying to narrow this down to a minimal example.Creight
@Fabio: what was the problem? Did I mess up something or it was something else?Dilatation
After a full rebuild of the project it started to work as expected. No problem with your code.Creight
@Fabio: oh ok, it's the usual mysterious full rebuild.Dilatation
@Matteo, the problem started to show up again. It was not a rebuild issue. Sometimes I compile with -msse2 and sometimes with -mavx2. Care must be taken to ensure the relevant comiss instruction is used, otherwise there is a massive penalty for mixing SSE and AVX instructions. So I added an #ifdef __AVX2__ to choose between comiss and vcomiss.Creight
The ternary operator is usually step 1 when trying to hand-hold a compiler into making branchless code, rather than making it convert ifs. Probably be good to suggest that before inline asm.Hombre
you should generalize this code and make it look like a compiler intrinsic, like this: __conditional_move(dst, conditon, src); which translates to if (condition) { dst = src }Stome

© 2022 - 2024 — McMap. All rights reserved.