Optimizing a branch for a known more-common path
Asked Answered
S

4

9

Please consider the following piece of code:

void error_handling();
bool method_impl();

bool method()
{
    const bool res = method_impl();
    if (res == false) {
        error_handling();
        return false;
    }
    return true;
}

I know method_impl() will return true 99.999% (yes, three decimal places) of the time, but my compiler doesn't. method() is partially critical in term of time-consumption.

  1. Should I rewrite method() (and make it less readable) to ensure a jump may only occur when method_impl() returns false? If yes, how?
  2. Should I let the compiler do the work for me?
  3. Should I let the branch prediction of my CPU do the work for me?
Shizue answered 11/3, 2016 at 10:48 Comment(10)
How about "Optimizing branch prediction" for the title, since it has nothing to do with error handling? I'm pretty sure the answer is something like use likely/unlikely macros, but that is compiler specific.Whiteeye
You should invert the logic so that the normal case is a fall through and the abnormal case is the implied else. This won't destroy readability. The way you have it now there is a branch 99.999% of the time.Involve
You could hint the compiler for your suggestion with __builtin_expect: stackoverflow.com/questions/1851299/…Partiality
The underlying hardware already performs this optimizations. It will "fail" to predict it the first times, but after it will hit the correct option en.wikipedia.org/wiki/Branch_predictorWert
@Wert do you think branch prediction can be as good as using the GCC extension __builtin_expect?Shizue
@Shizue you can try applying the GCC extension and check if it is faster with it or not, but I think you will barely see any difference with it and without it. The branch prediction is applied always, it is not somethign you enableWert
@Wert I think it deserves to be an answer ;)Shizue
Moving cold branches out-of-line saves precious L1i cache, which won't be measured in benchmarks.Tributary
@Tributary That's interesting. Where can I learn more about it? How can I experiment with it?Shizue
While the advice presented here is sound, rather than premature optimization, try using profile guided optimization. It should handle this optimization and many more that you may not even anticipate.Hexaemeron
W
5

The underlying hardware already performs this optimizations. It will "fail" to predict it the first times, but after it will hit the correct option en.wikipedia.org/wiki/Branch_predictor.

You can try applying the GCC extension and check if it is faster with it or not, but I think you will barely see any difference with it and without it. The branch prediction is applied always, it is not something you enable

Wert answered 11/3, 2016 at 11:55 Comment(0)
S
13

Following other answers' suggestions, I benchmarked the solutions. If you consider upvoting this answer, please upvote the others too.

Benchmark code

#include <iostream>
#include <iomanip>
#include <string>

// solutions
#include <ctime>

// benchmak
#include <limits>
#include <random>
#include <chrono>
#include <algorithm>
#include <functional>

//
// Solutions
//
namespace
{
    volatile std::time_t near_futur = -1;
    void error_handling() { std::cerr << "error\n"; }
    bool method_impl() { return std::time(NULL) != near_futur; }

    bool method_no_builtin()
    {
        const bool res = method_impl();
        if (res == false) {
            error_handling();
            return false;
        }
        return true;
    }

    bool method_builtin()
    {
        const bool res = method_impl();
        if (__builtin_expect(res, 1) == false) {
            error_handling();
            return false;
        }
        return true;
    }

    bool method_builtin_incorrect()
    {
        const bool res = method_impl();
        if (__builtin_expect(res, 0) == false) {
            error_handling();
            return false;
        }
        return true;
    }

    bool method_rewritten()
    {
        const bool res = method_impl();
        if (res == true) {
            return true;
        } else {
            error_handling();
            return false;
        }
    }
}

//
// benchmark
//
constexpr std::size_t BENCHSIZE = 10'000'000;
class Clock
{
    std::chrono::time_point<std::chrono::steady_clock> _start;

public:
    static inline std::chrono::time_point<std::chrono::steady_clock> now() { return std::chrono::steady_clock::now(); }

    Clock() : _start(now())
    {
    }

    template<class DurationUnit>
    std::size_t end()
    {
        return std::chrono::duration_cast<DurationUnit>(now() - _start).count();
    }
};

//
// Entry point
//
int main()
{
    {
        Clock clock;
        bool result = true;
        for (std::size_t i = 0 ; i < BENCHSIZE ; ++i)
        {
            result &= method_no_builtin();
            result &= method_no_builtin();
            result &= method_no_builtin();
            result &= method_no_builtin();
            result &= method_no_builtin();
            result &= method_no_builtin();
            result &= method_no_builtin();
            result &= method_no_builtin();
            result &= method_no_builtin();
            result &= method_no_builtin();
        }
        const double unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<double>(BENCHSIZE);
        std::cout << std::setw(40) << "method_no_builtin(): " << std::setprecision(3) << unit_time << " ns\n";
    }
    {
        Clock clock;
        bool result = true;
        for (std::size_t i = 0 ; i < BENCHSIZE ; ++i)
        {
            result &= method_builtin();
            result &= method_builtin();
            result &= method_builtin();
            result &= method_builtin();
            result &= method_builtin();
            result &= method_builtin();
            result &= method_builtin();
            result &= method_builtin();
            result &= method_builtin();
            result &= method_builtin();
        }
        const double unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<double>(BENCHSIZE);
        std::cout << std::setw(40) << "method_builtin(): " << std::setprecision(3) << unit_time << " ns\n";
    }
    {
        Clock clock;
        bool result = true;
        for (std::size_t i = 0 ; i < BENCHSIZE ; ++i)
        {
            result &= method_builtin_incorrect();
            result &= method_builtin_incorrect();
            result &= method_builtin_incorrect();
            result &= method_builtin_incorrect();
            result &= method_builtin_incorrect();
            result &= method_builtin_incorrect();
            result &= method_builtin_incorrect();
            result &= method_builtin_incorrect();
            result &= method_builtin_incorrect();
            result &= method_builtin_incorrect();
        }
        const double unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<double>(BENCHSIZE);
        std::cout << std::setw(40) << "method_builtin_incorrect(): " << std::setprecision(3) << unit_time << " ns\n";
    }
    {
        Clock clock;
        bool result = true;
        for (std::size_t i = 0 ; i < BENCHSIZE ; ++i)
        {
            result &= method_rewritten();
            result &= method_rewritten();
            result &= method_rewritten();
            result &= method_rewritten();
            result &= method_rewritten();
            result &= method_rewritten();
            result &= method_rewritten();
            result &= method_rewritten();
            result &= method_rewritten();
            result &= method_rewritten();
        }
        const double unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<double>(BENCHSIZE);
        std::cout << std::setw(40) << "method_rewritten(): " << std::setprecision(3) << unit_time << " ns\n";
    }
}

Benchmark results

g++ -std=c++14 -O2 -Wall -Wextra -Werror main.cpp

               method_no_builtin(): 42.8 ns
                  method_builtin(): 44.4 ns
        method_builtin_incorrect(): 51.4 ns
                method_rewritten(): 39.3 ns

Demo

g++ -std=c++14 -O3 -Wall -Wextra -Werror main.cpp

               method_no_builtin(): 32.3 ns
                  method_builtin(): 31.1 ns
        method_builtin_incorrect(): 35.6 ns
                method_rewritten(): 30.5 ns

Demo

Conclusion

The difference between those optimizations are too small to come to any conclusion other than: if there is a performance gain to find in optimizing a branch for a known more common path, this gain is too small to be worth the trouble and the loss in readability.

Shizue answered 11/3, 2016 at 12:14 Comment(8)
You have 11 trials in the first test and 10 in the other two. When you take out the 11th entry in the first test it falls in line with the other tests: coliru.stacked-crooked.com/a/2540c1a1f5d2b837 . Trying with -02 the first and last take the exact same amount of time: coliru.stacked-crooked.com/a/7f37496b4992755eGiffie
@Giffie Ho thank you! If only I could give you rep over that! I've edited my answer and its conclusion. Thank again.Shizue
I would like to see an incorrect __builtin_expect for completeness.Hodge
@immibis Excellent idea: done. Hope you're a patient person.Shizue
@YSC: this is a poorly constructed test. You're running the same code repeatedly in a loop, which means that you're measuring how well the CPU predicts the performance of a perfectly predictable branch. The benefit from likely / unlikely comes from more compact code, which reduces TLB and instruction cache consumption. (Ideally GCC would pass the information along using static prediction hints, but I don't think it ever does that.) These effects are very measurable and significant in real world applications, but microbenchmarks like yours are designed to eliminate cache misses.Discommodity
@Discommodity If you can provide me with a better test, I'd be happy to update my answer or upvote yours. Note that I'm not interested in cache miss here.Shizue
I need to amend my comment by saying that apparently branch performance also matters and is possibly more significant than TLB/icache. I was under the impression that a correctly-predicted branch had a latency of 0-1 cycles, but this is only the case if the branch is correctly-predicted and not taken. If the branch is taken, even if correctly predicted, the latency is significantly longer. The worst case is predicted not taken but then taken. This blog post gives measurements on various x86 CPUs as well as the Apple M1: blog.cloudflare.com/branch-predictorDiscommodity
Branch prediction resources like the BHT and BPB are limited, and in a real application the processor is going to lose history periodically. If your code is organized such that the most likely path is the one statically predicted by the processor, then on average you will come out ahead. What's really interesting is that, since not-taken branches are fast, reordering your code such that branches are more likely to be not taken than taken improves performance regardless of the predictors. GCC will most definitely reorder code based on __builtin_expect() hints.Discommodity
P
6

You could suggest the compiler that the method_impl() will return true:

void error_handling();
bool method_impl();

bool method()
{
    const bool res = method_impl();
    if (__builtin_expect (res, 0) == false) {
        error_handling();
        return false;
    }
    return true;
}

This will work in GCC.

Partiality answered 11/3, 2016 at 10:57 Comment(3)
Will there be a benefit from just letting the branch predictor do its magic?Shizue
Yes, and is broadly explained here: stackoverflow.com/questions/7346929/…Partiality
The compiler rearranges the instructions to bias the "hinted" branchPartiality
W
5

The underlying hardware already performs this optimizations. It will "fail" to predict it the first times, but after it will hit the correct option en.wikipedia.org/wiki/Branch_predictor.

You can try applying the GCC extension and check if it is faster with it or not, but I think you will barely see any difference with it and without it. The branch prediction is applied always, it is not something you enable

Wert answered 11/3, 2016 at 11:55 Comment(0)
W
0

Without knowing the implementation of std::time() I wouldn't conclude much from this test. From your own results, it seems like it dominates the time in the loop.

FWIW, I use likely()/unlikely() liberally myself when tuning code. I do not want to change the structure of the code, but when reading the assembly, I'd like to see the common path be a straight line of untaken branches. The key here (for me) is the readability of the assembly. The fact that this is also what will be fastest is secondary (the fastest possible branch is a correctly predicted untaken branch).

Whitehurst answered 10/2, 2018 at 23:5 Comment(3)
1/ This is not an answer, rather a comment on my answer.Shizue
2/ std::time dominating the time consumption in the loop is the point. I'll let you read again the question ;)Shizue
@T Thorn - The paragraph referring to std::time() indeed should be a comment to the previous post, and as we see distracts people from your actual, valid point concerning likely()/unlikely(). I suggest to remove the first paragraph.Apportionment

© 2022 - 2024 — McMap. All rights reserved.