Why C compilers optimize switch and if differently
Asked Answered
S

4

11

I was working on a personal project recently when I stumbled across an odd issue.

In a very tight loop I have an integer with a value between 0 and 15. I need to get -1 for values 0, 1, 8, and 9 and 1 for values 4, 5, 12, and 13.

I turned to godbolt to check a few options and was surprised that it seemed the compiler couldn't optimize a switch statement the same way as an if chain.

The link is here: https://godbolt.org/z/WYVBFl

The code is:

const int lookup[16] = {-1, -1, 0, 0, 1, 1, 0, 0, -1, -1, 0, 0, 1, 1, 0, 0};

int a(int num) {
    return lookup[num & 0xF];
}

int b(int num) {
    num &= 0xF;

    if (num == 0 || num == 1 || num == 8 || num == 9) 
        return -1;

    if (num == 4 || num == 5 || num == 12 || num == 13)
        return 1;

    return 0;
}

int c(int num) {
    num &= 0xF;
    switch (num) {
        case 0: case 1: case 8: case 9: 
            return -1;
        case 4: case 5: case 12: case 13:
            return 1;
        default:
            return 0;
    }
}

I would have thought that b and c would yield the same results, and I was hoping that I could read the bit-hacks to come up with an efficient implementation myself since my solution (the switch statement - in another form) was fairly slow.

Oddly, b compiled to bit-hacks while c was either pretty much un-optimized or reduced to a different case of a depending on target hardware.

Can anybody explain why there is this discrepancy? What is the 'correct' way to optimize this query?

EDIT:

Clarification

I want the switch solution to be the fastest, or a similarly "clean" solution. However when compiled with optimizations on my machine the if solution is significantly faster.

I wrote a quick program to demonstrate and TIO has the same results as I find locally: Try it online!

With static inline the lookup table speeds up a bit: Try it online!

Sixth answered 10/10, 2019 at 22:52 Comment(5)
I suspect the answer is "Compilers don't always make sane choices". I just compiled your code to an object with GCC 8.3.0 with -O3, and it compiled c to something likely worse than a or b (c had two conditional jumps plus a few bit manipulations, vs. only one conditional jump and simpler bit manip for b), but still better than naive item by item tests. I'm not sure what you're really asking for here; the simple fact is that an optimizing compiler can turn any of these into any of the others if it so chooses, and there are no hard and fast rules for what it will or will not do.Pahl
My issue is that I need it to be fast, but the if solution is not overly maintainable. Is there any way to get the compiler to optimize a cleaner solution sufficiently? Can anybody explain why it can't do so in this case?Sixth
I would start by defining at least the functions as static, or -even better- inlining them.Levis
@Levis does speed it up, but if still beats switch (oddly lookup becomes even faster) [TIO to follow]Sixth
@Sixth There's no way to tell a compiler to optimise in a specific way. You'll note that clang and msvc generate completely different code for these. If you don't care and just want whatever works best on gcc, then pick that. Compiler optimisations are based on heuristics, and those don't yield the optimal solution in all cases; They're trying to be good in the average case, not optimal in all cases.Golightly
P
8

If you explicitely enumerate all the cases, gcc is very efficient :

int c(int num) {
    num &= 0xF;
    switch (num) {
        case 0: case 1: case 8: case 9: 
            return -1;
        case 4: case 5: case 12: case 13:
            return 1;
            case 2: case 3: case 6: case 7: case 10: case 11: case 14: case 15: 
        //default:
            return 0;
    }
}

is just compiled in a simple indexed branch :

c:
        and     edi, 15
        jmp     [QWORD PTR .L10[0+rdi*8]]
.L10:
        .quad   .L12
        .quad   .L12
        .quad   .L9
        .quad   .L9
        .quad   .L11
        .quad   .L11
        .quad   .L9
        .quad   .L9
        .quad   .L12
etc...

Note that if default: is uncommented, gcc turns back to its nested branch version.

Platinic answered 10/10, 2019 at 23:56 Comment(1)
@Sixth You should consider unaccepting my answer and accepting this one, because modern Intel CPUs can do two parallel indexed memory reads/cycle whereas the throughput of my trick is probably 1 lookup/cycle. On the flip side, perhaps my hack is more amenable to 4-way vectorization with SSE2 pslld/psrad or their 8-way AVX2 equivalents. Much depends on the other particularities of your code.Verina
D
4

C compilers have special cases for switch, because they expect programmers to understand the idiom of switch and exploit it.

Code like:

if (num == 0 || num == 1 || num == 8 || num == 9) 
    return -1;

if (num == 4 || num == 5 || num == 12 || num == 13)
    return 1;

would not pass review by competent C coders; three or four reviewers would simultaneously exclaim "this should be a switch!"

It's not worth it for C compilers to analyze the structure of if statements for conversion to a jump table. The conditions for that have to be just right, and the amount of variation that is possible in a bunch of if statements is astronomical. The analysis is both complicated and likely to come up negative (as in: "no, we can't convert these ifs to a switch").

Downatheel answered 10/10, 2019 at 23:1 Comment(9)
I know, that is why I started with the switch. However, the if solution is significantly faster in my case. I'm basically asking if there is a way to convince the compiler to use a better solution for the switch, since it was able to find the pattern in the ifs, but not the switch. (I don't like the ifs specifically because they aren't as clear or maintainable)Sixth
Upvoted but not accepted since the sentiment is exactly the reason why I made this question. I want to use the switch, but it is too slow in my case, I want to avoid the if if at all possible.Sixth
@LambdaBeta: Is there some reason to avoid the lookup table? Make it static, and use C99 designated initializers if you want to make it a little more clear what you're assigning, and it's clearly perfectly fine.Pahl
I would start off at least discarding the low bit so that there's less work for the optimizer to have to do.Mcghee
@Pahl Unfortunately that still is slower than the if (see edit). @R.. I worked out the full bitwise solution for the compiler, which is what I am using for now. Unfortunately in my case these are enum values, not naked integers, so bitwise hacks aren't very maintainable.Sixth
@LambdaBeta: You can add a static assert for the enum values. Or, if they're actually free to change and this function is the bottleneck, you can redefine their values to make the transformation trivial, e.g. (x>>1)-1Mcghee
I thought about that, but this is the most complex case. The values are already well optimized such that the other similar tests I perform are trivial.Sixth
@LambdaBeta: The lookup solution a is slower in a tight loop? That's... unexpected. Perhaps make the array an array of int8_ts, to ensure the whole table fits in a cache line (on x86, the read and sign extension remains a single instruction)? Random memory loads may be relatively expensive normally, but a small table referenced repeatedly in a tight loop should almost never actually need to read RAM.Pahl
@Pahl I think the reason in my case it doesn't work to well is that I have about 6 different 'lookups' in the same function. Its a fairly memory-heavy function. Normally I'd just leave the lookup as the best solution. Which I may just do anyways.Sixth
V
4

The following code will compute your lookup branchfree, LUT-free, in ~3 clock cycles, ~4 useful instructions and ~13 bytes of highly-inline-able x86 machine code.

It depends on a 2's complement integer representation.

You must, however, ensure that the u32 and s32 typedefs really point to 32-bit unsigned and signed integer types. stdint.h types uint32_t and int32_t would have been suitable but I have no idea if the header is available to you.

const int lookup[16] = {-1, -1, 0, 0, 1, 1, 0, 0, -1, -1, 0, 0, 1, 1, 0, 0};

int a(int num) {
    return lookup[num & 0xF];
}


int d(int num){
    typedef unsigned int u32;
    typedef signed   int s32;

    // const int lookup[16]     = {-1, -1, 0, 0, 1, 1, 0, 0, -1, -1, 0, 0, 1, 1, 0, 0};
    // 2-bit signed 2's complement: 11 11 00 00 01 01 00 00 11 11 00 00 01 01 00 00
    // Hexadecimal:                   F     0     5     0     F     0     5     0
    const u32 K = 0xF050F050U;

    return (s32)(K<<(num+num)) >> 30;
}

int main(void){
    for(int i=0;i<16;i++){
        if(a(i) != d(i)){
            return !0;
        }
    }
    return 0;
}

See for yourself here: https://godbolt.org/z/AcJWWf


On the selection of the constant

Your lookup is for 16 very small constants between -1 and +1 inclusive. Each fits within 2 bits and there are 16 of them, which we may lay out as follows:

// const int lookup[16]     = {-1, -1, 0, 0, 1, 1, 0, 0, -1, -1, 0, 0, 1, 1, 0, 0};
// 2-bit signed 2's complement: 11 11 00 00 01 01 00 00 11 11 00 00 01 01 00 00
// Hexadecimal:                   F     0     5     0     F     0     5     0
u32 K = 0xF050F050U;

By placing them with index 0 nearest the most-significant bit, a single shift of 2*num will place the sign bit of your 2-bit number into the sign bit of the register. Shifting right the 2-bit number by 32-2=30 bits sign-extends it to a full int, completing the trick.

Verina answered 10/10, 2019 at 23:29 Comment(3)
This might just be the cleanest way to do it with a magic comment explaining how to regenerate it. Could you explain how you came up with it?Sixth
Accepted since this can be made 'clean' while also being fast. (via some preprocessor magic :) <xkcd.com/541 >)Sixth
Beats my branchless attempt: !!(12336 & (1<<x))-!!(771 & (1<<x));Rival
I
0

You can create the same effect using only arithmetic:

// produces : -1 -1 0 0 1 1 0 0 -1 -1 0 0 1 1 0 0 ...
int foo ( int x )
{
    return 1 - ( 3 & ( 0x46 >> ( x & 6 ) ) );
}

Even though, technically, this is still a (bitwise) lookup.

If the above seems too arcane, you can also do:

int foo ( int x )
{
    int const y = x & 6;
    return (y == 4) - !y;
}
Inaptitude answered 11/10, 2019 at 18:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.