C++: Performance impact of if inside loops
Asked Answered
B

3

6

I need to iterate over lots of (2D) data and only sometimes handle special cases. For my application speed is the most critical factor.

The options that quickly come to (my) mind are:

Option A:

  • more readable
  • performance hit due to comparisons inside loops?
void ifInLoop(bool specialCase, MyClass &acc) {
  for (auto i = 0; i < n; ++i) {
    for (auto j = 0; j < n; ++j) {
      if (specialCase) {
        acc.foo();
      } else {
        acc.bar();
      }
    }
  }
}

Option B:

  • Code duplication
void loopsInIf(bool specialCase, MyClass &acc) {
  if (specialCase) {
    for (auto i = 0; i < n; ++i) {
      for (auto j = 0; j < n; ++j) {
        acc.foo();
      }
    }
  } else {
    for (auto i = 0; i < n; ++i) {
      for (auto j = 0; j < n; ++j) {
        acc.bar();
      }
    }
  }
}

Option C:

  • Templates
  • Ugly to call
  • basically same as B?
template <bool specialCase> 
void templateIf(MyClass &acc) {
  for (auto i = 0; i < n; ++i) {
    for (auto j = 0; j < n; ++j) {
      if (specialCase) {
        acc.foo();
      } else {
        acc.bar();
      }
    }
  }
}

I know this falls under premature Optimization. However, from a theoretical point of view, I would be interested in the differences of these snippets when compiled with -O3 (GCC / Clang) with respect to produced assembly and speed.

(There already exists a similar question about this in Perl but I would like to know about C++ specifically.)

(EDIT) Is specialCase known at compile time?

Not really. The call itself is in another loop and some iterations are treated differently. So something like (but not necessarily equidistant however independent of the user input):

for (int i = 0; i < m; ++i) {
  ifInLoop(i % 10, acc);
}

How would I use Option C here? Introducing an extra if, therefore I would expect it to be very similar to B.

for (int i = 0; i < m; ++i) {
  if (i % 10)
    templateIf<true>(acc);
  else
    templateIf<false>(acc);
}
Bravura answered 13/7, 2019 at 10:17 Comment(12)
Profile it. There 's no golden bullet. That said, the bottleneck of your application is unlikely to be that one.Shaunta
This site lets you see the resulting assembly code: godbolt.orgReactant
"However, from a theoretical point of view, I would be interested in the differences of these snippets when compiled with -O3 (GCC / Clang) with respect to produced assembly and speed." What's stopping you from compiling the code with relevant flags, and looking at the differences in assembly?Alejandraalejandrina
You can use option B without code duplication by extracting the looping behaviour into a template<Fnc> apply(Fnc fnc) and use it like apply([&](){acc.foo();}) or apply([&](){acc.bar();})Mufinella
The first option shouldn't have any performance problems. 1) Any branch prediction algorithm will quickly learn the pattern. 2) Compiler may optimize it into option B (or may not). 3) First measure performance, then optimize; 99% it won't be a bottleneck => prefer more readable option.Tobiastobie
It doesn't necessarily fall under premature optimisation, which is often quoted without context. You could make the specialCase const in your first, might encourage the compiler to optimise sensibly, the id be tempted to go with a variation on option B, you could assign the right function to a function pointer and call through the pointer in your loop. Also if performance is important, depending on the size of I and j in your data, you could make them a 1d array and access the elements with i * j.Polysyllable
Option D) maintain two separate containers. One with all the "specialCase" objects, the other with the rest. Then just loop over the correct container.Lycanthrope
@dyukha: predicting perfectly doesn't get rid of front-end throughput cost of the instructions that check the branch condition, if the compiler doesn't host the if and generate two versions of the loop. There could be significant simplifications / optimizations possible if the compiler knows that only one of the if or else bodies runs every iteration, but checking at runtime and branching misses out on those optimizations even if it predicts perfectly.Ofris
@MichaelChourdakis: profiling one test case won't tell you the cost in general. Which optimizations you miss out on, and whether the compiler is willing to split the loop for you and hoist the branch, depend on the details of the loop (perhaps including how complex the loop body is). Looking at the asm for your specific case with the compilers you care about is the only way to be sure.Ofris
what are the typical sizes of your arrays? small may already be optimized away; for large ones, you need tool and then try parallelization and vectorization (at least check you code is well vectorized by compiler)Bicycle
@Seriously: updated my answer with a way to help the compiler use constant specialCase values for a repeating pattern.Ofris
I found this question today because I have a loop in my code where the if inside the loop is taking 25% of total time, so pulling the test outside of the loop is definitely a win in my (time-critical) code. What I'm wondering is: why is transforming from A to B not something that compilers/optimizers can do automatically? Is there a mainstream compiler (clang, g++) that will do this if given some kind of #pragma hint? If not, why not? I'd love a way to express "please replicate the loop below into two separate cases based on flag" without nasty template syntax, etc.Stove
O
13

If this function can inline into a caller that passes a compile-time-constant bool, you're fine with option A (as long as the function is small enough to inline). i.e. if a template arg is possible, you don't usually actually need it. Except when if forces you to write if(var) { foo<true>(arg); }else {foo<false>(arg); } to encourage the compiler to make asm with 2 versions of the loop.

All modern compilers are smart enough to inline small functions and then completely optimize away an if(constant). Inlining + constant-propagation are what make modern C++ possible to compile efficiently.


But if the bool value is not known at compile-time, option B is likely more efficient. (If the function doesn't run often its speed might not matter in the big picture, and the different might be small.)

It's a tradeoff between static code size (I-cache footprint) vs. dynamic instruction count. Or if the special case rarely runs, that version of the loop might stay cold in cache.


for (int i = 0; i < m; ++i) {
  ifInLoop(i % 10, acc);
}

If you really do have a repeating pattern like this, the compiler might decide to unroll this loop for you so the bool becomes a compile-time constant.

Or you could hand-hold the compiler into maybe making nicer asm if it doesn't decide to invent new inner-loops itself, and an unroll by 10 of a loop containing another whole loop is too much for the compiler's heuristics.

int i;
for (i = 0; i < m-9; i+=10) {   // potentially runs zero times if signed m <= 9
    ifInLoop(false, acc);    // this is the j=0
    for (int j=1; j<10 ; j++)   // j = i%10
       ifInLoop(true, acc);     // original i = i+j  in case you need it
}
// cleanup loop:
for ( ; i < m ; i++) {
    ifInLoop(i % 10, acc);
}

Predicting perfectly doesn't get rid of front-end + back-end throughput cost of the instructions that check the branch condition, if the compiler doesn't hoist the if and generate two versions of the loop.

There could be significant simplifications / optimizations possible if the compiler knows that only one of the if or else bodies runs every iteration, but checking at runtime and branching misses out on those optimizations even if it predicts perfectly.


The usual Stack Overflow response of "profile it" is not as useful as most people seem to think. First of all, microbenchmarking is hard. It's very easy to measure the wrong thing entirely, or draw nonsense conclusions because you don't know enough about what could possibly matter and what doesn't. (Make sure to warm up your CPU to max turbo frequency and init the memory first so you don't have CoW mapping to the zero page and the first timed pass isn't paying page-fault + TLB-miss costs. Compile with optimization enabled, and check that performance scales linearly with your repeat count.)

Profiling one test case won't tell you the cost in general. Which optimizations you miss out on, and whether the compiler is willing to split the loop for you and hoist the branch, depend on the details of the loop (perhaps including how complex the loop body is).

Looking at the asm for your specific case with the compilers you care about is the only way to be sure.

Different compilers (or different versions of the same compiler, or with different tuning options like gcc -mtune=generic vs. gcc -mtune=skylake) can certainly make a difference in whether the compiler decides to invert/split the loop to select once between two loops. Tune options set heuristic constants for decisions like this and loop unrolling where there's a tradeoff between static code-size and dynamic instruction-count.

Part of that may depend on how much work is outside the if() and would have to get duplicated unchanged when splitting.

Ofris answered 13/7, 2019 at 10:49 Comment(1)
Idiomatic way of performance evaluation? has more about how to microbenchmark.Ofris
D
2

For this kind of scenario option C is the best. If you can use template<bool specialCase> it means that specialCase must be known at compile time, therefore you can use if constexpr as shown

if constexpr(specialCase)
{
    acc.foo()
}
else
{
    acc.bar()
}

Instead, if specialCase isn't know at compile time i would choose option B because condition is evaluated only once

Diplodocus answered 13/7, 2019 at 10:54 Comment(1)
inlining + constant-propagation makes option A equivalent to a template for most cases when the caller passes a constant as the bool arg. But yes option B is probably best if you don't have a compile-time constant arg.Ofris
K
2

An optimiser will likely treat any real code differently than this fake code and whatever foo() and bar() do is likely to dominate in any case.

"From a theoretical point of view" as you put it, the issue is that specialCase is loop invariant, so avoiding conditional evaluation and branching on that value will bring benefits. In practice however, it is possible that the compiler will spot that it is loop invariant and remove that issue for you, as such the differences between each solution may not be down to the loop-invariant evaluation.

The only realistic means of determining the fastest solution or whether the difference is significant enough to justify uglier, harder to follow or maintain code is to profile it; an activity likely to take up more of your life than either solution will save - the compiler optimiser will likely have a far larger impact, and your productivity likely to increase by not worrying about such micro-optimisation - it is most likely a false economy.


An alternative option to consider also - given a pointer-to-member-function member: void (MyClass::*foobar)() ; then:

void ifInLoopD( bool specialCase, MyClass& acc )
{
    // FIXME: use a local, not class member, for the pointer-to-member-function
    acc.foobar = specialCase ? &MyClass::foo : &MyClass::bar ;

    for( auto i = 0; i < n; ++i )
    {
        for( auto j = 0; j < n; ++j )
        {
            (acc.*acc.foobar)() ;
        }
    }
}

See C++ Call Pointer To Member Function for how to use a local variable holding a pointer-to-member-function. But keep in mind the benchmark data in this answer is from this version, which may have stopped some compilers from realizing that the function pointer didn't change between calls and thus could be inlined. (Until the compiler tries inlining the pointed-to member function, it won't realize that the function doesn't change the pointer-member of the class.)


Editor's note: the benchmark numbers for version D probably aren't representative of using it with most loop bodies.

The benchmarks that show this pointer-to-member-function having similar performance to other methods is based on a function body that bottlenecks on the latency of incrementing a static volatile int.

In the resulting asm, that creates a loop-carried dependency chain that includes store-forwarding latency. First of all, that can hide a lot of loop overhead. On a modern out-of-order execution CPU like any x86, costs don't just add up. Things can overlap: lots of loop overhead can run in the shadow of that latency bottleneck.

Even worse, store-forwarding latency isn't constant and can get faster when there's more overhead, especially unrelated stores, between the store and reload. See Loop with function call faster than an empty loop and Adding a redundant assignment speeds up code when compiled without optimization (where debug builds keep their loop counter in memory to create this bottleneck). Using volatile forces asm like that even in optimized builds.

On Intel Sandybridge-family, volatile increment can get faster with more loop overhead. So this choice of loop body creates benchmark numbers that are extremely misleading if you try to generalize to other more typical cases. As I (Peter) said in my answer, microbenchmarking is hard. See discussion in comments for more details.

The benchmark numbers in this question are for this code, but you should expect other loop bodies to be qualitatively different.

Note that this answer is careful not to draw any conclusions about what might be faster in real code

But I'll add that a non-inlined function call inside an inner loop is almost always going to be more expensive that an easily-predicted branch inside an inner loop. A non-inline function call forces the compiler to update all values in memory that were temporarily only in registers, so the state of memory matches the C++ abstract machine. At least for globals and static vars, and anything pointed-to / reachable via the function args (including this for member functions). It also clobbers all the call-clobbered registers.

So performance-wise, I'd expect a pointer-to-member-function initialized outside the loop to be similar to option A (if() inside) but almost always worse. Or equal if they both optimize away from constant propagation.

End of editor's note


For each implementation A, B and mine which I shall call D, (I have omitted C because I cannot figure out how you intend to use that in a practical implementation), and given:

class MyClass
{
    public:
        void foo(){ volatile static int a = 0 ; a++ ; }
        void bar(){ volatile static int a = 0 ; a++ ; }
    // FIXME: don't put a tmp var inside the class object!
    // but keep in mind the benchmark results below *are* done with this
        void (MyClass::*foobar)() ;

} acc ;

static const int n = 10000 ;

I got the following results:

VC++ 2019 Default Debug: (note: don't time debug-mode, that's almost always useless.)

ifInLoopA( true, acc )  : 3.146 seconds
ifInLoopA( false, acc ) : 2.918 seconds
ifInLoopB( true, acc )  : 2.892 seconds
ifInLoopB( false, acc ) : 2.872 seconds
ifInLoopD( true, acc )  : 3.078 seconds
ifInLoopD( false, acc ) : 3.035 seconds

VC++ 2019 Default Release:

ifInLoopA( true, acc )  : 0.247 seconds
ifInLoopA( false, acc ) : 0.242 seconds
ifInLoopB( true, acc )  : 0.234 seconds
ifInLoopB( false, acc ) : 0.242 seconds
ifInLoopD( true, acc )  : 0.219 seconds
ifInLoopD( false, acc ) : 0.205 seconds

As you can see while in debug solution D is significantly slower, in the optimised build it is significantly faster. Also the selection of specialCase value has a marginal effect - although I am not entirely sure why.

I increased n to 30000 for the release build to get better resolution:

VC++ 2019 Default Release n=30000:

ifInLoopA( true, acc )  : 2.198 seconds
ifInLoopA( false, acc ) : 1.989 seconds
ifInLoopB( true, acc )  : 1.934 seconds
ifInLoopB( false, acc ) : 1.979 seconds
ifInLoopD( true, acc )  : 1.721 seconds
ifInLoopD( false, acc ) : 1.732 seconds

Clearly solution A is most sensitive to specialCase, and might be avoided is deterministic behaviour is required, but that difference may be swamped by differences in the real foo() andbar()` implementations.

Your results may very depending on your compiler, target and compiler options used, and the differences are probably not so significant that you can draw any conclusion for all compilers.

For example using g++ 5.4.1 at https://www.onlinegdb.com/, the difference between un-optimised and optimised code is far less significant (possibly due to the far greater functionality in the VC++ debugger imposing significant overhead), and for the optimised code the differences between solutions is far less significant.

(editor's note: MSVC debug-mode includes indirection in function calls to allow incremental recompiling, so that could explain the huge amount of extra overhead in debug mode. Yet another reason not to time debug-mode.

It's not too surprising for the volatile increment to limit performance to about the same as debug-mode (which keeps loop counters in memory); two separate store-forwarding latency chains can overlap.)

https://www.onlinegdb.com/ C++14 default options, n = 30000

ifInLoopA( true, acc )  : 3.29026 seconds
ifInLoopA( false, acc ) : 3.08304 seconds
ifInLoopB( true, acc )  : 3.21342 seconds
ifInLoopB( false, acc ) : 3.26737 seconds
ifInLoopD( true, acc )  : 3.74404 seconds
ifInLoopD( false, acc ) : 3.72961 seconds

https://www.onlinegdb.com/ C++14 default -O3, n=30000

ifInLoopA( true, acc )  : 3.07913 seconds                                                                                                      
ifInLoopA( false, acc ) : 3.09762 seconds                                                                                                      
ifInLoopB( true, acc )  : 3.13735 seconds                                                                                                      
ifInLoopB( false, acc ) : 3.05647 seconds                                                                                                      
ifInLoopD( true, acc )  : 3.09078 seconds                                                                                                      
ifInLoopD( false, acc ) : 3.04051 seconds 

I think the only conclusion you can draw is that you have to test each solution to determine how well they work with your compiler and target implementation, and with your real code not a made-up loop body.

If all solutions meet your performance requirements, I suggest that you use the most readable/maintainable solution and look at optimisation only when performance becomes a problem when you will be able to identify exactly which part of the code overall will give you the biggest impact for the least amount of effort.


For completeness and to allow you to perform your own assessment, here is my test code:

class MyClass
{
    public:
        void foo(){ volatile static int a = 0 ; a++ ; }
        void bar(){ volatile static int a = 0 ; a++ ; }
        void (MyClass::*foobar)() ;

} acc ;

static const int n = 30000 ;

void ifInLoopA( bool specialCase, MyClass& acc ) {
    for( auto i = 0; i < n; ++i ) {
        for( auto j = 0; j < n; ++j ) {
            if( specialCase ) {
                acc.foo();
            }
            else {
                acc.bar();
            }
        }
    }
}

void ifInLoopB( bool specialCase, MyClass& acc ) {
    if( specialCase ) {
        for( auto i = 0; i < n; ++i ) {
            for( auto j = 0; j < n; ++j ) {
                acc.foo();
            }
        }
    }
    else {
        for( auto i = 0; i < n; ++i ) {
            for( auto j = 0; j < n; ++j ) {
                acc.bar();
            }
        }
    }
}

void ifInLoopD( bool specialCase, MyClass& acc )
{
    acc.foobar = specialCase ? &MyClass::foo : &MyClass::bar ;

    for( auto i = 0; i < n; ++i )
    {
        for( auto j = 0; j < n; ++j )
        {
            (acc.*acc.foobar)() ;
        }
    }
}


#include <ctime>
#include <iostream>

int main()
{
    std::clock_t start = std::clock() ;
    ifInLoopA( true, acc ) ;
    std::cout << "ifInLoopA( true, acc )  : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;

    start = std::clock() ;
    ifInLoopA( false, acc ) ;
    std::cout << "ifInLoopA( false, acc ) : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;

    start = std::clock() ;
    ifInLoopB( true, acc ) ;
    std::cout << "ifInLoopB( true, acc )  : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;

    start = std::clock() ;
    ifInLoopB( false, acc ) ;
    std::cout << "ifInLoopB( false, acc ) : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;

    start = std::clock() ;
    ifInLoopD( true, acc ) ;
    std::cout << "ifInLoopD( true, acc )  : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;

    start = std::clock() ;
    ifInLoopD( false, acc ) ;
    std::cout << "ifInLoopD( false, acc ) : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;
}
Kandykane answered 13/7, 2019 at 11:48 Comment(3)
Comments are not for extended discussion; this conversation has been moved to chat.Astrionics
Ok, added a section summarizing what I said in comments. Also pointing out that this answer is careful not to generalize from these results; yes on careful reading you were careful not to draw conclusions. But that really needs to be pointed out in bold face because people like to look at numbers. (Also I added some scattered commentary elsewhere. I should maybe have just changed that bit about the Visual C++ debugger; debuggers don't add overhead unless you're single-stepping, it's just a matter of code-gen and probably MSVC's indirection for incremental linking in debug builds.Ofris
Anyway, removed my downvote. IDK if I want to upvote; the numbers in this answer are a cautionary tale / example of why it's hard to microbenchmark! The lack of conclusions being drawn makes it not wrong, but of pretty low value. Anyway, thanks for letting me edit in my notes; there was some interesting stuff to think about / explain.Ofris

© 2022 - 2024 — McMap. All rights reserved.