Is 'switch' faster than 'if'?
Asked Answered
J

12

285

Is a switch statement actually faster than an if statement?

I ran the code below on Visual Studio 2010's x64 C++ compiler with the /Ox flag:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define MAX_COUNT (1 << 29)
size_t counter = 0;

size_t testSwitch()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        switch (counter % 4 + 1)
        {
            case 1: counter += 4; break;
            case 2: counter += 3; break;
            case 3: counter += 2; break;
            case 4: counter += 1; break;
        }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

size_t testIf()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = counter % 4 + 1;
        if (c == 1) { counter += 4; }
        else if (c == 2) { counter += 3; }
        else if (c == 3) { counter += 2; }
        else if (c == 4) { counter += 1; }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

int main()
{
    printf("Starting...\n");
    printf("Switch statement: %u ms\n", testSwitch());
    printf("If     statement: %u ms\n", testIf());
}

and got these results:

Switch statement: 5261 ms
If statement: 5196 ms

From what I've learned, switch statements apparently use jump tables to optimize the branching.

Questions:

  1. What would a basic jump table look like, in x86 or x64?

  2. Is this code using a jump table?

  3. Why is there no performance difference in this example? Is there any situation in which there is a significant performance difference?


Disassembly of the code:

testIf:

13FE81B10 sub  rsp,48h 
13FE81B14 call qword ptr [__imp_clock (13FE81128h)] 
13FE81B1A mov  dword ptr [start],eax 
13FE81B1E mov  qword ptr [i],0 
13FE81B27 jmp  testIf+26h (13FE81B36h) 
13FE81B29 mov  rax,qword ptr [i] 
13FE81B2E inc  rax  
13FE81B31 mov  qword ptr [i],rax 
13FE81B36 cmp  qword ptr [i],20000000h 
13FE81B3F jae  testIf+0C3h (13FE81BD3h) 
13FE81B45 xor  edx,edx 
13FE81B47 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B4E mov  ecx,4 
13FE81B53 div  rax,rcx 
13FE81B56 mov  rax,rdx 
13FE81B59 inc  rax  
13FE81B5C mov  qword ptr [c],rax 
13FE81B61 cmp  qword ptr [c],1 
13FE81B67 jne  testIf+6Dh (13FE81B7Dh) 
13FE81B69 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B70 add  rax,4 
13FE81B74 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B7B jmp  testIf+0BEh (13FE81BCEh) 
13FE81B7D cmp  qword ptr [c],2 
13FE81B83 jne  testIf+89h (13FE81B99h) 
13FE81B85 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B8C add  rax,3 
13FE81B90 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B97 jmp  testIf+0BEh (13FE81BCEh) 
13FE81B99 cmp  qword ptr [c],3 
13FE81B9F jne  testIf+0A5h (13FE81BB5h) 
13FE81BA1 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BA8 add  rax,2 
13FE81BAC mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BB3 jmp  testIf+0BEh (13FE81BCEh) 
13FE81BB5 cmp  qword ptr [c],4 
13FE81BBB jne  testIf+0BEh (13FE81BCEh) 
13FE81BBD mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BC4 inc  rax  
13FE81BC7 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BCE jmp  testIf+19h (13FE81B29h) 
13FE81BD3 call qword ptr [__imp_clock (13FE81128h)] 
13FE81BD9 sub  eax,dword ptr [start] 
13FE81BDD imul eax,eax,3E8h 
13FE81BE3 cdq       
13FE81BE4 mov  ecx,3E8h 
13FE81BE9 idiv eax,ecx 
13FE81BEB cdqe      
13FE81BED add  rsp,48h 
13FE81BF1 ret       

testSwitch:

13FE81C00 sub  rsp,48h 
13FE81C04 call qword ptr [__imp_clock (13FE81128h)] 
13FE81C0A mov  dword ptr [start],eax 
13FE81C0E mov  qword ptr [i],0 
13FE81C17 jmp  testSwitch+26h (13FE81C26h) 
13FE81C19 mov  rax,qword ptr [i] 
13FE81C1E inc  rax  
13FE81C21 mov  qword ptr [i],rax 
13FE81C26 cmp  qword ptr [i],20000000h 
13FE81C2F jae  testSwitch+0C5h (13FE81CC5h) 
13FE81C35 xor  edx,edx 
13FE81C37 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C3E mov  ecx,4 
13FE81C43 div  rax,rcx 
13FE81C46 mov  rax,rdx 
13FE81C49 inc  rax  
13FE81C4C mov  qword ptr [rsp+30h],rax 
13FE81C51 cmp  qword ptr [rsp+30h],1 
13FE81C57 je   testSwitch+73h (13FE81C73h) 
13FE81C59 cmp  qword ptr [rsp+30h],2 
13FE81C5F je   testSwitch+87h (13FE81C87h) 
13FE81C61 cmp  qword ptr [rsp+30h],3 
13FE81C67 je   testSwitch+9Bh (13FE81C9Bh) 
13FE81C69 cmp  qword ptr [rsp+30h],4 
13FE81C6F je   testSwitch+0AFh (13FE81CAFh) 
13FE81C71 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C73 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C7A add  rax,4 
13FE81C7E mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C85 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C87 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C8E add  rax,3 
13FE81C92 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C99 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C9B mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CA2 add  rax,2 
13FE81CA6 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CAD jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81CAF mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CB6 inc  rax  
13FE81CB9 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CC0 jmp  testSwitch+19h (13FE81C19h) 
13FE81CC5 call qword ptr [__imp_clock (13FE81128h)] 
13FE81CCB sub  eax,dword ptr [start] 
13FE81CCF imul eax,eax,3E8h 
13FE81CD5 cdq       
13FE81CD6 mov  ecx,3E8h 
13FE81CDB idiv eax,ecx 
13FE81CDD cdqe      
13FE81CDF add  rsp,48h 
13FE81CE3 ret       

Update:

Interesting results here. Not sure why one is faster and one is slower, though.

Jaine answered 24/7, 2011 at 5:0 Comment(21)
Your longer example is flawed, the compiler and optimizer can (and apparently trivially do) prove that the cases outside the range 1-4 cannot happen, and so these are elided, at least in the if case.Dethrone
@Hasturkun: You're absolutely right, I didn't see that happening. And furthermore, it's not just that -- even if I change it to 20, it's still the same. But once I change it to 21, the performance of switch beats that of if dramatically. It seems like the lack of a "default" case really affects this.Jaine
The mod 21 case is faster for the switch because it (at least on my compiler) does a single range comparison, skips over the jump table if out of range. while the if always does the comparisons in order.Dethrone
I would be surprised if the compiler did generate a jump table. It would just introduce a table of big 64 bit pointers and a load of pointless boilerplate to branch to/from the target.Unassailable
@Hasturkun: You've seen the ASM proving this? Post it if you have. Like @ PacketScience I would be very surprised if that kind of an optimization is still in common use. (Though @PacketScience, it wouldn't be 64 bit pointers; on x86 these jumps would probably be indirect (needing only a char of space for the offset))Ihram
What on Earth are the people voting to close this thinking? Are they such believers in the notion of the perfectly optimizing compiler that any thought of its generating less than ideal code is heresy? Does the very idea of any optimization anywhere offend them?Wanids
What exactly is wrong with this question?Influence
The "cmp qword ptr [rsp+30h],1" and "je testSwitch+87h (13FE81C87h)" etc. lines are clear as day... cmp is compare, je is jump if equal... the compiler obviously hasn't generated a jump table in this case. The time difference you measure is a random error while comparing if/else machine code to itself. Did you really try to read the assembly?Leahleahey
@BillyONeal: I was talking about specifically about the mod 21 with 20 cases, which is actually worse than I thought originally, since after the first step, counter = 20, which makes it skip the switch (and its jump table) entirely since 21 > 20. (I am using an x86, btw, not a 64 bit platform), the if code ends up checking 1-20, making it slower. the mod 20 case only ever uses the first case, so it's hot code, caches well.Dethrone
To anyone wondering what is wrong with this question: For starters, it is not a question, it is 3 questions, which means that many of the answers now address different issues. This means that it will be hard to accept any answer that answers everything. Additionally, the typical knee-jerk reaction to the above question is to close it as "not really all that interesting" mostly due to the fact that at this level of optimization, you're almost always prematurely optimizing. Lastly, 5196 vs. 5261 shouldn't be enough to actually care. Write the logical code that makes sense.Edithe
The Close as Not Constructive description: "This question is not a good fit to our Q&A format. We expect answers to generally involve facts, references, or specific expertise; this question will likely solicit opinion, debate, arguments, polling, or extended discussion"Wheat
@Lasse: Would you really have preferred me to post three questions on SO instead? Also: 5196 vs. 5261 shouldn't be enough to actually care --> I'm not sure if you misunderstood the question or if I misunderstand your comment, but isn't the whole point of my question to ask why there isn't a difference? (Did I ever claim that's a significant difference to care about?)Jaine
@Robert: Yes, I can indeed read the FAQ. :) But which part of that are you referring to? (Am I really "polling" people here? If not, which parts do you mean?)Jaine
Everything else but polling. Opinion, debate, arguments and extended discussion. The question came onto the mod radar because one answer has more than 20 comments on it.Wheat
I would've preferred it if you asked 1 question so that the question fit the SO paradigm. Also, we're butting up against so many unknown factors here, like cpu pipelines, jump-prediction, cache handling, etc. that there is likely nobody that can really answer the question, except the compiler. And that's why I say you shouldn't care, even if you have numbers that show that they're equal or that they're slightly different.Edithe
@Robert: Well it only has more than 20 comments on it because they're meta-comments. There's only 7 comments actually related to the question here. Opinion: I don't see how there's "opinion" here. There's a reason that I'm not seeing a performance difference, no? Is it just taste? Debate: Maybe, but it looks like a healthy kind of debate to me, like I've seen on other places on SO (let me know if there's anything counter to that). Arguments: I don't see anything argumentative here (unless you're taking it as a synonym for 'debate'?). Extended discussion: If you include these meta-comments.Jaine
I'm just trying to tell you why your question was closed the first time. The FAQ states that "You should only ask practical, answerable questions based on actual problems that you face." While this question might be interesting, it's hard to imagine how it qualifies as a practical question based on an actual problem.Wheat
Which is why you don't see us competing to close it either. There is good content in this question, but for the future, try to avoid asking more than one question, make the other questions derivatives (ie. if the one question you asked is answered, the answers of the unasked questions are implied), or similar, that way it will be more of a fit with the question-answer model of SO. Right now you end up with a sort-of poll where people vote on the different answers, but you can still only accept one of them.Edithe
@Robert: The actual problem is using switch versus if. I've run across that problem tons of times (e.g. for a lexer I'm making for D, if you really need a specific specific example) and I'm trying to see why I would or wouldn't see a performance difference. Of course, the example is more general because the point isn't to make a lexer (that was just one situation, out of many... another included a matrix multiply optimization I had, etc.). Just because it's hard to imagine it being practical doesn't mean it isn't!Jaine
@Lasse: "for the future, try to avoid asking more than one question" --> Definitely; it might be hard to avoid it all the time but I'll try. Thanks for the advice! :)Jaine
I'm surprised none of icc/gcc/clang -O3 (godbolt.org/g/ovVJHU) notice that they could implement this as counter += 5 - (counter%4), i.e. counter += 5 - (counter&3). Also, that once we reach the counter += 4; case, it always repeats so you can just do counter += 4 * i; break;.Jemappes
I
145

There are several optimizations a compiler can make on a switch. I don't think the oft-mentioned "jump-table" is a very useful one though, as it only works when the input can be bounded some way.

C Pseudocode for a "jump table" would be something like this -- note that the compiler in practice would need to insert some form of if test around the table to ensure that the input was valid in the table. Note also that it only works in the specific case that the input is a run of consecutive numbers.

If the number of branches in a switch is extremely large, a compiler can do things like using binary search on the values of the switch, which (in my mind) would be a much more useful optimization, as it does significantly increase performance in some scenarios, is as general as a switch is, and does not result in greater generated code size. But to see that, your test code would need a LOT more branches to see any difference.

To answer your specific questions:

  1. Clang generates one that looks like this:

    test_switch(char):                       # @test_switch(char)
            movl    %edi, %eax
            cmpl    $19, %edi
            jbe     .LBB0_1
            retq
    .LBB0_1:
            jmpq    *.LJTI0_0(,%rax,8)
            jmp     void call<0u>()         # TAILCALL
            jmp     void call<1u>()         # TAILCALL
            jmp     void call<2u>()         # TAILCALL
            jmp     void call<3u>()         # TAILCALL
            jmp     void call<4u>()         # TAILCALL
            jmp     void call<5u>()         # TAILCALL
            jmp     void call<6u>()         # TAILCALL
            jmp     void call<7u>()         # TAILCALL
            jmp     void call<8u>()         # TAILCALL
            jmp     void call<9u>()         # TAILCALL
            jmp     void call<10u>()        # TAILCALL
            jmp     void call<11u>()        # TAILCALL
            jmp     void call<12u>()        # TAILCALL
            jmp     void call<13u>()        # TAILCALL
            jmp     void call<14u>()        # TAILCALL
            jmp     void call<15u>()        # TAILCALL
            jmp     void call<16u>()        # TAILCALL
            jmp     void call<17u>()        # TAILCALL
            jmp     void call<18u>()        # TAILCALL
            jmp     void call<19u>()        # TAILCALL
    .LJTI0_0:
            .quad   .LBB0_2
            .quad   .LBB0_3
            .quad   .LBB0_4
            .quad   .LBB0_5
            .quad   .LBB0_6
            .quad   .LBB0_7
            .quad   .LBB0_8
            .quad   .LBB0_9
            .quad   .LBB0_10
            .quad   .LBB0_11
            .quad   .LBB0_12
            .quad   .LBB0_13
            .quad   .LBB0_14
            .quad   .LBB0_15
            .quad   .LBB0_16
            .quad   .LBB0_17
            .quad   .LBB0_18
            .quad   .LBB0_19
            .quad   .LBB0_20
            .quad   .LBB0_21
    
  2. I can say that it is not using a jump table -- 4 comparison instructions are clearly visible:

    13FE81C51 cmp  qword ptr [rsp+30h],1 
    13FE81C57 je   testSwitch+73h (13FE81C73h) 
    13FE81C59 cmp  qword ptr [rsp+30h],2 
    13FE81C5F je   testSwitch+87h (13FE81C87h) 
    13FE81C61 cmp  qword ptr [rsp+30h],3 
    13FE81C67 je   testSwitch+9Bh (13FE81C9Bh) 
    13FE81C69 cmp  qword ptr [rsp+30h],4 
    13FE81C6F je   testSwitch+0AFh (13FE81CAFh) 
    

    A jump table based solution does not use comparison at all.

  3. Either not enough branches to cause the compiler to generate a jump table, or your compiler simply doesn't generate them. I'm not sure which.

EDIT 2014: There has been some discussion elsewhere from people familiar with the LLVM optimizer saying that the jump table optimization can be important in many scenarios; e.g. in cases where there is an enumeration with many values and many cases against values in said enumeration. That said, I stand by what I said above in 2011 -- too often I see people thinking "if I make it a switch, it'll be the same time no matter how many cases I have" -- and that's completely false. Even with a jump table you get the indirect jump cost and you pay for entries in the table for each case; and memory bandwidth is a Big Deal on modern hardware.

Write code for readability. Any compiler worth its salt is going to see an if / else if ladder and transform it into equivalent switch or vice versa if it would be faster to do so.

Ihram answered 24/7, 2011 at 5:9 Comment(14)
+1 for actually answering the question, and for useful info. :-) However, a question: From what I understand, a jump table uses indirect jumps; is that correct? If so, isn't that usually slower due to more difficult prefetching/pipelining?Jaine
@Mehrdad: Yes, it uses indirect jumps. However, one indirect jump (with the pipeline stall it comes with) may be less than hundreds of direct jumps. :)Ihram
I see... so I'd probably need more than 4 comparisons, eh? :) Interesting!Jaine
The results are pretty interesting apparently! Seems like the if was even faster when there were more options. Any ideas?Jaine
@Mehrdad: No, unfortunately. :( I'm glad I'm in the camp of people who always think the IF is more readable! :)Ihram
Few quips - "[switches] only works when the input can be bounded some way" "need to insert some form of if test around the table to ensure that the input was valid in the table. Note also that it only works in the specific case that the input is a run of consecutive numbers.": it's entirely possible to have a sparsely populated table, where the potential pointer is read and only if non-NULL is a jump performed, otherwise the default case if any is jumped to, then the switch exits. Soren's said several other things I wanted to say after reading this answer.Leahleahey
@Tony: True; you could populate missing parts of the jump table with pointers pointing after the body of the switch. But the cost in code size is almost certainly greater than the IF tests in such cases. Memory is the bottleneck on modern CPUs, not execution time.Ihram
Accepting, since this clearly answers most parts of my question, even though it's missing some minor parts. Thanks!Jaine
"Any compiler worth its salt is going to see an if / else if ladder and transform it into equivalent switch or vice versa" - any support for this assertion? a compiler might assume that the order of your if clauses has already been hand-tuned to match frequency and relative performance needs, where as a switch is traditionally seen as an open invitation to optimise however the compiler chooses. Good point re jumping past switch :-). Code size depends on cases/range - could be better. Finally, some enums, bit fields and char scenarios are inherently valid/bounded & overhead free.Leahleahey
@TonyD: If the compiler would assume that, then there's no reason it wouldn't make the same assumption about the order of case statements in your switch and refuse to reorder there. That said, compilers usually have in-source hints or profile guided optimization tools to tell the optimizer you think a case is likely rather than relying on source code order. Sure, there are plenty of scenarios where the input can be bound. But these cases are less common. I don't intend to say "jump tables are useless" -- I intend to say "jump tables are not a magic bullet you get by using switch"Ihram
@TonyD: (Note that enums are typically not cases where the compiler can bound the input because enum by default in C and C++ is just an alias to int)Ihram
@BillyONeal: for C++, that starts from C++11 - before that the underlying type was unspecified but required to be large enough to cover 0 through to the number formed by taking the largest enumeration and turning all the less-significant bits on (e.g. 0xD -> 0xF) - bit different when there were negative enumerations and I can't remember off the top of my head. Even with C++11, I'm not sure if knowing the underlying type is an int requires defined behaviour when set outside the "old range"... interesting question.Leahleahey
@BillyONeal: In C++11, for an enum without a fixed underlying type "the values of the enumeration are the values in the range bmin to bmax", and per 5.2.9/10 when setting it to anything else "the resulting value is unspecified (and might not be in that range)", so only if an implementation happens to keep it in the range (e.g. bitwise-AND enums to keep them <= bmax) then the integral value would be bounded without any action by switch.Leahleahey
@TonyD: I know its been a few years but note that Clang generates the same code for the switch vs the if: goo.gl/VSi2afIhram
L
50

To your question:

1.What would a basic jump table look like, in x86 or x64?

Jump table is memory address that holds pointer to the labels in something like array structure. following example will help you understand how jump tables are laid out

00B14538  D8 09 AB 00 D8 09 AB 00 D8 09 AB 00 D8 09 AB 00  Ø.«.Ø.«.Ø.«.Ø.«.
00B14548  D8 09 AB 00 D8 09 AB 00 D8 09 AB 00 00 00 00 00  Ø.«.Ø.«.Ø.«.....
00B14558  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00B14568  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

enter image description here

Where 00B14538 is the pointer to the Jump table , and value like D8 09 AB 00 represents label pointer.

2.Is this code using a jump table? No in this case.

3.Why is there no performance difference in this example?

There is no performance difference because instruction for both case looks same, no jump table.

4.Is there any situation in which there is a significant performance difference?

If you have very long sequence of if check, in that case using a jump table improves performance (branching/jmp instructions are expensive if they don't predict near-perfectly) but comes with the cost of memory.

The code for all the compare instructions has some size, too, so especially with 32-bit pointers or offsets, a single jump table lookup might not cost a lot more size in an executable.

Conclusion: Compiler is smart enough handle such case and generate appropriate instructions :)

Lerma answered 24/7, 2011 at 7:14 Comment(1)
(edit: nvm, Billy's answer already has what I was suggesting. I guess this is a nice suplement.) It would be good to include gcc -S output: a sequence of .long L1 / .long L2 table entries is more meaningful than a hexdump, and more useful to someone that wants to learn how to look at compiler. (Although I guess you'd just look at the switch code to see if it was an indirect jmp or a bunch of jcc).Jemappes
K
34

The compiler is free to compile the switch statement as a code which is equivalent to if-statement, or to create a jump table. It will likely chose one on the other based on what will execute fastest or generate the smallest code somewhat depending on what you have specified in you compiler options -- so worst case it will be the same speed as if-statements

I would trust the compiler to do the best choice and focus on what makes the code most readable.

If the number of cases becomes very large a jump table will be much faster than a series of if. However if the steps between the values is very large, then the jump table can become large, and the compiler may choose not to generate one.

Kus answered 24/7, 2011 at 5:8 Comment(7)
I don't think this answers the OP's question. At all.Ihram
The basic question was which is faster?Kus
@Soren: If that was the "basic question" then I wouldn't have bothered with the 179 other lines in the question, it'd have just been 1 line. :-)Jaine
@Soren: I see at least 3 numbered sub-questions as part of the OP's question. You've merely trumpeted the exact same answer which applies to all "performance" questions -- namely, that you have to measure first. Consider that maybe Mehrdad has already measured, and has isolated this piece of code to be a hot spot. In such cases, your answer is worse than worthless, it is noise.Ihram
There is a blurred line between what is a jump table and what is not depending on your definition. I have provided information on sub-question part 3.Kus
@Billy ONeal: Yes. The reason it's trumpeted is because it's the only right answer.Precipitate
@wnoise: If it's the only right answer then there would never be a reason to ever ask any performance question. However, there are some of us in the real world who do measure our software, and we sometimes don't know how to make a piece of code faster once is has been measured. It's obvious that Mehrdad put some effort into this question before asking it; and I think his specific questions are more than answerable.Ihram
B
15

How do you know your computer was not performing some task unrelated to the test during the switch test loop and performing fewer tasks during the if test loop? Your test results do not show anything as:

  1. the difference is very small
  2. there is only one result, not a series of results
  3. there are too few cases

My results:

I addded:

printf("counter: %u\n", counter);

to the end so that it would not optimise away the loop as counter was never used in your example so why would the compiler perform the loop? Immediately, the switch was always winning even with such a micro-benchmark.

The other problem with your code is:

switch (counter % 4 + 1)

in your switch loop, versus

const size_t c = counter % 4 + 1; 

in your if loop. Very big difference if you fix that. I believe that putting the statement inside the switch statement provokes the compiler into sending the value directly into the CPU registers rather than putting it on the stack first. This is therefore in favour of the switch statement and not a balanced test.

Oh and I think you should also reset counter between tests. In fact, you probably should be using some kind of random number instead of +1, +2, +3 etc, as it will probably optimise something there. By random number, I mean a number based on the current time, for example. Otherwise, the compiler could turn both of your functions into one long math operation and not even bother with any loops.

I have modified Ryan's code just enough to make sure the compiler couldn't figure things out before the code had run:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define MAX_COUNT (1 << 26)
size_t counter = 0;

long long testSwitch()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = rand() % 20 + 1;

        switch (c)
        {
                case 1: counter += 20; break;
                case 2: counter += 33; break;
                case 3: counter += 62; break;
                case 4: counter += 15; break;
                case 5: counter += 416; break;
                case 6: counter += 3545; break;
                case 7: counter += 23; break;
                case 8: counter += 81; break;
                case 9: counter += 256; break;
                case 10: counter += 15865; break;
                case 11: counter += 3234; break;
                case 12: counter += 22345; break;
                case 13: counter += 1242; break;
                case 14: counter += 12341; break;
                case 15: counter += 41; break;
                case 16: counter += 34321; break;
                case 17: counter += 232; break;
                case 18: counter += 144231; break;
                case 19: counter += 32; break;
                case 20: counter += 1231; break;
        }
    }
    return 1000 * (long long)(clock() - start) / CLOCKS_PER_SEC;
}

long long testIf()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = rand() % 20 + 1;
        if (c == 1) { counter += 20; }
        else if (c == 2) { counter += 33; }
        else if (c == 3) { counter += 62; }
        else if (c == 4) { counter += 15; }
        else if (c == 5) { counter += 416; }
        else if (c == 6) { counter += 3545; }
        else if (c == 7) { counter += 23; }
        else if (c == 8) { counter += 81; }
        else if (c == 9) { counter += 256; }
        else if (c == 10) { counter += 15865; }
        else if (c == 11) { counter += 3234; }
        else if (c == 12) { counter += 22345; }
        else if (c == 13) { counter += 1242; }
        else if (c == 14) { counter += 12341; }
        else if (c == 15) { counter += 41; }
        else if (c == 16) { counter += 34321; }
        else if (c == 17) { counter += 232; }
        else if (c == 18) { counter += 144231; }
        else if (c == 19) { counter += 32; }
        else if (c == 20) { counter += 1231; }
    }
    return 1000 * (long long)(clock() - start) / CLOCKS_PER_SEC;
}

int main()
{
    srand(time(NULL));
    printf("Starting...\n");
    printf("Switch statement: %lld ms\n", testSwitch()); fflush(stdout);
    printf("counter: %d\n", counter);
    counter = 0;
    srand(time(NULL));
    printf("If     statement: %lld ms\n", testIf()); fflush(stdout);
    printf("counter: %d\n", counter);
} 

switch: 3740
if: 3980

(similar results over multiple attempts)

I also reduced the number of cases/ifs to 5 and the switch function still won.

Blenheim answered 24/7, 2011 at 6:19 Comment(20)
Idk, I can't prove it; do you get different results?Jaine
+1: Benchmarking is difficult, and you really can't draw any conclusions from a small time difference on a single run on a normal computer. You may try to run a large number of tests and do some statistics on the results. Or counting processor cycles on controlled execution in an emulator.Classroom
Er, where exactly did you add the print statement? I added it at the end of the entire program and saw no difference. I also don't understand what the "problem" with the other one is... mind explaining what the "very big difference" is?Jaine
@BobTurbo: And more importantly: What were your timings?Jaine
I added the printf in the same place you did. It is probably a compiler difference if you don't see anything. I also made sure that c was declared before using it in the switch. No idea why that sped things up, but it made a very large difference. I also increased switches/ifs to 10, and the results were: 45ms for switch, 45983493ms or so for if.Blenheim
@BobTurbo: 45983493 is over 12 hours. Was that a typo?Quince
@BobTrubo: take a look at: ideone.com/P6ybm. This shows the run on IDEone.com, where the if block is faster even with your changes.Churchyard
@BobTurbo: You made the classic mistake of overflowing your arithmetic (I did that a bunch of times, too). Double-check your code.Jaine
no it is a typo as I couldn't be bothered typing in the exact number.Blenheim
actually maybe I did overflow the output.... I didn't bother to check that part of the code.Blenheim
great, now I have to go do it again :)Blenheim
@Ryan Gross: I am getting the same results as you on your code, but I will make some changes to ensure it is actually testing the switch vs if.. back in a sec.Blenheim
The problem with this is that you've made the time taken effectively random. For a true test of the differences here you need to initialize the random number generator with srand to the same seed before each test. Also, I'm not sure this works as a reasonable test with rand being run inside the loop. At least with the same seed though it's overhead should be consistent between runs.Ihram
1. the values generated are irrelevant. 2. if I have introduced a random variable, that means it is in fact still perfectly valid as the randomness cancels itself out over multiple tests. But as each test produces the same result every single time (switch wins), your comment is not true.Blenheim
It is not ideal to use rand as the random number generator isn't really random, but it is enough to fool the compiler.Blenheim
rand() is should have a random spread so % performance will be fairly constant. This could be improved by making the cases/ifs a power of 2, which will make the % more likely to be a constant operation due to optimisations. Either way, the randomness is cancelled out (if there is any). In fact, should just remove the % and replace it with & and power of 2 number of ifs and cases.. but I can't be bothered and it will result in the same thing - switch is faster than if unless the compiler can optimise it into something else due to, for example, the values being known before runtime.Blenheim
@BobTurbo: The values generated are relevant because they control which branch of the switch or if/else ladder gets taken. Lower results (from the mod operation) will make whichever setup happened to get lower mods look better, because fewer comparisons are being performed. The values you posted are so close I doubt the compiler is doing huge (jump table like) transformations. (My guess is that it did do the binary search optimization but that's not too helpful with only 20 cases). If you want to benchmark comparisons you need to compare the same thing. :)Ihram
@Billy ONeal, I think that the percentage each if/case is hit will be close enough to even that it won't effect the results. Either way, as I said, the test is repeated over and over again, each time with the switch winning. If both were equally fast, this would have a close to 1/infinity chance of occuring by accident.Blenheim
@Bob: You didn't show any kind of statistical analysis of these tests... I don't see anything showing consistency. As for the small percentages, you're showing a six percent difference. If the difference was larger I could see making the assumption that switch is always faster, but with differences that small I suspect there's little to no difference in real world cases.Ihram
@BobTurbo: Better late than never? As Billy pointed out some comments ago, you've got rand() inside the loop. Shouldn't you generate the random numbers into an array before you start timing? As it is, you are comparing "cost of rand plus if" against "cost of rand plus switch". If rand takes a lot more time than either the if or the switch, you could be seriously diluting the comparison.Umbilication
F
7

A good optimizing compiler such as MSVC can generate:

  1. a simple jump table if the cases are arranged in a nice long range
  2. a sparse (two-level) jump table if there are many gaps
  3. a series of ifs if the number of cases is small or the values are not close together
  4. a combination of above if the cases represent several groups of closely-spaced ranges.

In short, if the switch looks to be slower than a series of ifs, the compiler might just convert it to one. And it's likely to be not just a sequence of comparisons for each case, but a binary search tree. See here for an example.

Fomentation answered 25/7, 2011 at 7:59 Comment(1)
Actually, a compiler is also able to replace it with a hash and jump, which performs better than the sparse two-level solution you propose.Calia
E
7

Here are some results from the old (now hard to find) bench++ benchmark:

Test Name:   F000003                         Class Name:  Style
CPU Time:       0.781  nanoseconds           plus or minus     0.0715
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 2-way if/else if statement
 compare this test with F000004

Test Name:   F000004                         Class Name:  Style
CPU Time:        1.53  nanoseconds           plus or minus     0.0767
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 2-way switch statement
 compare this test with F000003

Test Name:   F000005                         Class Name:  Style
CPU Time:        7.70  nanoseconds           plus or minus      0.385
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way if/else if statement
 compare this test with F000006

Test Name:   F000006                         Class Name:  Style
CPU Time:        2.00  nanoseconds           plus or minus     0.0999
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way switch statement
 compare this test with F000005

Test Name:   F000007                         Class Name:  Style
CPU Time:        3.41  nanoseconds           plus or minus      0.171
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way sparse switch statement
 compare this test with F000005 and F000006

What we can see from this is that (on this machine, with this compiler -- VC++ 9.0 x64), each if test takes about 0.7 nanoseconds. As the number of tests goes up, the time scales almost perfectly linearly.

With the switch statement, there's almost no difference in speed between a 2-way and a 10-way test, as long as the values are dense. The 10-way test with sparse values takes about 1.6x as much time as the 10-way test with dense values -- but even with sparse values, still better than twice the speed of a 10-way if/else if.

Bottom line: using only a 4-way test won't really show you much about the performance of switch vs if/else. If you look at the numbers from this code, it's pretty easy to interpolate the fact that for a 4-way test, we'd expect the two to produce pretty similar results (~2.8 nanoseconds for an if/else, ~2.0 for switch).

Eagan answered 26/7, 2012 at 3:20 Comment(1)
Bit hard to know what to make of that if we don't know whether the test deliberately seeks a value not matched by or only matched at the end of the if/else chain vs. scattering them etc.. Can't find the bench++ sources after 10 minutes googling.Leahleahey
M
5

I'll answer 2) and make some general comments. 2) No, there is no jump table in the assembly code you've posted. A jump table is a table of jump destinations, and one or two instructions to jump directly to an indexed location from the table. A jump table would make more sense when there are many possible switch destinations. Maybe the optimiser knows that simple if else logic is faster unless the number of destinations is greater than some threshold. Try your example again with say 20 possibilities instead of 4.

Mako answered 24/7, 2011 at 5:12 Comment(1)
+1 thanks for the answer to #2! :) (Btw, here are the results with more possibilities.)Jaine
C
4

I was intrigued, and took a look at what I could change about your example to get it to run the switch statement faster.

If you get to 40 if statements, and add a 0 case, then the if block will run slower than the equivalent switch statement. I have the results here: https://www.ideone.com/KZeCz.

The effect of removing the 0 case can be seen here: https://www.ideone.com/LFnrX.

Churchyard answered 24/7, 2011 at 16:21 Comment(1)
Your links have broken down.Flip
A
3

Note that when a switch is NOT compiled to a jump table, you can very often write if's more efficiently than the switch...

(1) if the cases have an ordering, rather than the worst case testing for all N, you can write your if's to test if in the upper or lower half, then in each half of that, binary search style... resulting in the worst case being logN rather than N

(2) if certain cases/groups are far more frequent than other cases, then designing your if's to isolate those cases first can speed up the average time through

Arrowhead answered 1/10, 2011 at 9:22 Comment(4)
This is markedly untrue; compilers are more than capable of making BOTH of these optimizations.Calia
Alice, how is a compiler supposed to know which cases will occur more commonly than other cases in your expected workloads? (A: It can't possibly know, so it can't possibly do such an optimization.)Arrowhead
(1) can be done easily, and is done in some compilers, by simply doing a binary search. (2) can be predicted in a variety of ways, or indicated to the compiler. Have you never used GCC's "likely" or "unlikely"?Calia
And some compilers allow to run the program in a mode that gather statistics and then optimize from that information.Kiger
M
2

No these are if then jump else if then jump else...A jump table would have a table of addresses or use a hash or something like that.

Faster or slower is subjective. You could for example have case 1 be the last thing instead of first and if your test program or real world program used case 1 most of the time the code would be slower with this implementation. So just re-arranging the case list, depending on the implementation, can make a big difference.

If you had used cases 0-3 instead of 1-4 the compiler might have used a jump table, the compiler should have figured out removing your +1 anyway. Perhaps it was the small number of items. Had you made it 0 - 15 or 0 - 31 for example it may have implemented it with a table or used some other shortcut. The compiler is free to choose how it implements things so long as it meets the functionality of the source code. And this gets into compiler differences and version differences and optimization differences. If you want a jump table, make a jump table, if you want an if-then-else tree make an if-then-else tree. If you want the compiler to decide, use a switch/case statement.

Merous answered 24/7, 2011 at 14:0 Comment(0)
B
2

Not sure why one is faster and one is slower, though.

That is actually not too hard to explain... If you remember that mispredicted branches are tens to hundreds of times more expensive than correctly predicted branches.

In the % 20 version, the first case/if is always the one that hits. Modern CPUs "learn" which branches are usually taken and which are not, so they can easily predict how this branch will behave on almost every iteration of the loop. That explains why the "if" version flies; it never has to execute anything past the first test, and it (correctly) predicts the result of that test for most of the iterations. Obviously the "switch" is implemented slightly differently -- perhaps even a jump table, which can be slow thanks to the computed branch.

In the % 21 version, the branches are essentially random. So not only do many of them execute every iteration, the CPU cannot guess which way they will go. This is the case where a jump table (or other "switch" optimization) is likely to help.

It is very hard to predict how a piece of code is going to perform with a modern compiler and CPU, and it gets harder with every generation. The best advice is "don't even bother trying; always profile". That advice gets better -- and the set of people who can ignore it successfully gets smaller -- every year.

All of which is to say that my explanation above is largely a guess. :-)

Blossom answered 25/7, 2011 at 2:33 Comment(5)
I don't see where hundreds of times slower can come from. Worst case of a mispredicted branch is a pipeline stall, which would be ~20 times slower on most modern CPUs. Not hundreds of times. (Okay, if you're using an old NetBurst chip it might be 35x slower...)Ihram
@Billy: OK, so I am looking ahead a little. On Sandy Bridge processors, "Each mispredicted branch will flush the entire pipeline, losing the work of up to a hundred or so in-flight instructions". The pipelines really do get deeper with every generation, in general...Blossom
Not true. The P4 (NetBurst) had 31 pipeline stages; Sandy Bridge has significantly fewer stages. I think the "losing the work of 100 or so instructions" is under the assumption that the instruction cache gets invalidated. For a general indirect jump that does in fact happen, but for something like a jump table it's likely the target of the indirect jump lies somewhere in the instruction cache.Ihram
@Billy: I do not think we disagree. My statement was: "Mispredicted branches are tens to hundreds of times more expensive than correctly predicted branches". A slight exaggeration, perhaps... But there is more going on than just hits in the I-cache and execution pipeline depth; from what I have read, the queue for decode alone is ~20 instructions.Blossom
If the branch prediction hardware mispredicts the execution path, the uops from the incorrect path which are in the instruction pipeline are simply removed where they are, without stalling execution. I have no idea how this is possible (or whether I'm misinterpreting it), but apparently there are no pipeline stalls with mispredicted branches in Nehalem? (Then again, I don't have an i7; I have an i5, so this doesn't apply to my case.)Jaine
T
1

None. In most particular cases where you go into the assembler and do real measurements of performance your question is simply the wrong one. For the given example your thinking goes definitively too short since

counter += (4 - counter % 4);

looks to me to be the correct increment expression that you should be using.

Tinder answered 24/7, 2011 at 7:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.