An optimiser will likely treat any real code differently than this fake code and whatever foo()
and bar()
do is likely to dominate in any case.
"From a theoretical point of view" as you put it, the issue is that specialCase
is loop invariant, so avoiding conditional evaluation and branching on that value will bring benefits. In practice however, it is possible that the compiler will spot that it is loop invariant and remove that issue for you, as such the differences between each solution may not be down to the loop-invariant evaluation.
The only realistic means of determining the fastest solution or whether the difference is significant enough to justify uglier, harder to follow or maintain code is to profile it; an activity likely to take up more of your life than either solution will save - the compiler optimiser will likely have a far larger impact, and your productivity likely to increase by not worrying about such micro-optimisation - it is most likely a false economy.
An alternative option to consider also - given a pointer-to-member-function member: void (MyClass::*foobar)() ;
then:
void ifInLoopD( bool specialCase, MyClass& acc )
{
// FIXME: use a local, not class member, for the pointer-to-member-function
acc.foobar = specialCase ? &MyClass::foo : &MyClass::bar ;
for( auto i = 0; i < n; ++i )
{
for( auto j = 0; j < n; ++j )
{
(acc.*acc.foobar)() ;
}
}
}
See C++ Call Pointer To Member Function for how to use a local variable holding a pointer-to-member-function. But keep in mind the benchmark data in this answer is from this version, which may have stopped some compilers from realizing that the function pointer didn't change between calls and thus could be inlined. (Until the compiler tries inlining the pointed-to member function, it won't realize that the function doesn't change the pointer-member of the class.)
Editor's note: the benchmark numbers for version D probably aren't representative of using it with most loop bodies.
The benchmarks that show this pointer-to-member-function having similar performance to other methods is based on a function body that bottlenecks on the latency of incrementing a static volatile int
.
In the resulting asm, that creates a loop-carried dependency chain that includes store-forwarding latency. First of all, that can hide a lot of loop overhead. On a modern out-of-order execution CPU like any x86, costs don't just add up. Things can overlap: lots of loop overhead can run in the shadow of that latency bottleneck.
Even worse, store-forwarding latency isn't constant and can get faster when there's more overhead, especially unrelated stores, between the store and reload. See Loop with function call faster than an empty loop and Adding a redundant assignment speeds up code when compiled without optimization (where debug builds keep their loop counter in memory to create this bottleneck). Using volatile
forces asm like that even in optimized builds.
On Intel Sandybridge-family, volatile
increment can get faster with more loop overhead. So this choice of loop body creates benchmark numbers
that are extremely misleading if you try to generalize to other more typical cases. As I (Peter) said in my answer, microbenchmarking is hard. See discussion in comments for more details.
The benchmark numbers in this question are for this code, but you should expect other loop bodies to be qualitatively different.
Note that this answer is careful not to draw any conclusions about what might be faster in real code
But I'll add that a non-inlined function call inside an inner loop is almost always going to be more expensive that an easily-predicted branch inside an inner loop. A non-inline function call forces the compiler to update all values in memory that were temporarily only in registers, so the state of memory matches the C++ abstract machine. At least for globals and static vars, and anything pointed-to / reachable via the function args (including this
for member functions). It also clobbers all the call-clobbered registers.
So performance-wise, I'd expect a pointer-to-member-function initialized outside the loop to be similar to option A (if()
inside) but almost always worse. Or equal if they both optimize away from constant propagation.
End of editor's note
For each implementation A, B and mine which I shall call D, (I have omitted C because I cannot figure out how you intend to use that in a practical implementation), and given:
class MyClass
{
public:
void foo(){ volatile static int a = 0 ; a++ ; }
void bar(){ volatile static int a = 0 ; a++ ; }
// FIXME: don't put a tmp var inside the class object!
// but keep in mind the benchmark results below *are* done with this
void (MyClass::*foobar)() ;
} acc ;
static const int n = 10000 ;
I got the following results:
VC++ 2019 Default Debug: (note: don't time debug-mode, that's almost always useless.)
ifInLoopA( true, acc ) : 3.146 seconds
ifInLoopA( false, acc ) : 2.918 seconds
ifInLoopB( true, acc ) : 2.892 seconds
ifInLoopB( false, acc ) : 2.872 seconds
ifInLoopD( true, acc ) : 3.078 seconds
ifInLoopD( false, acc ) : 3.035 seconds
VC++ 2019 Default Release:
ifInLoopA( true, acc ) : 0.247 seconds
ifInLoopA( false, acc ) : 0.242 seconds
ifInLoopB( true, acc ) : 0.234 seconds
ifInLoopB( false, acc ) : 0.242 seconds
ifInLoopD( true, acc ) : 0.219 seconds
ifInLoopD( false, acc ) : 0.205 seconds
As you can see while in debug solution D is significantly slower, in the optimised build it is significantly faster. Also the selection of specialCase
value has a marginal effect - although I am not entirely sure why.
I increased n
to 30000 for the release build to get better resolution:
VC++ 2019 Default Release n=30000:
ifInLoopA( true, acc ) : 2.198 seconds
ifInLoopA( false, acc ) : 1.989 seconds
ifInLoopB( true, acc ) : 1.934 seconds
ifInLoopB( false, acc ) : 1.979 seconds
ifInLoopD( true, acc ) : 1.721 seconds
ifInLoopD( false, acc ) : 1.732 seconds
Clearly solution A is most sensitive to specialCase
, and might be avoided is deterministic behaviour is required, but that difference may be swamped by differences in the real foo() and
bar()` implementations.
Your results may very depending on your compiler, target and compiler options used, and the differences are probably not so significant that you can draw any conclusion for all compilers.
For example using g++ 5.4.1 at https://www.onlinegdb.com/, the difference between un-optimised and optimised code is far less significant (possibly due to the far greater functionality in the VC++ debugger imposing significant overhead), and for the optimised code the differences between solutions is far less significant.
(editor's note: MSVC debug-mode includes indirection in function calls to allow incremental recompiling, so that could explain the huge amount of extra overhead in debug mode. Yet another reason not to time debug-mode.
It's not too surprising for the volatile
increment to limit performance to about the same as debug-mode (which keeps loop counters in memory); two separate store-forwarding latency chains can overlap.)
https://www.onlinegdb.com/ C++14 default options, n = 30000
ifInLoopA( true, acc ) : 3.29026 seconds
ifInLoopA( false, acc ) : 3.08304 seconds
ifInLoopB( true, acc ) : 3.21342 seconds
ifInLoopB( false, acc ) : 3.26737 seconds
ifInLoopD( true, acc ) : 3.74404 seconds
ifInLoopD( false, acc ) : 3.72961 seconds
https://www.onlinegdb.com/ C++14 default -O3, n=30000
ifInLoopA( true, acc ) : 3.07913 seconds
ifInLoopA( false, acc ) : 3.09762 seconds
ifInLoopB( true, acc ) : 3.13735 seconds
ifInLoopB( false, acc ) : 3.05647 seconds
ifInLoopD( true, acc ) : 3.09078 seconds
ifInLoopD( false, acc ) : 3.04051 seconds
I think the only conclusion you can draw is that you have to test each solution to determine how well they work with your compiler and target implementation, and with your real code not a made-up loop body.
If all solutions meet your performance requirements, I suggest that you use the most readable/maintainable solution and look at optimisation only when performance becomes a problem when you will be able to identify exactly which part of the code overall will give you the biggest impact for the least amount of effort.
For completeness and to allow you to perform your own assessment, here is my test code:
class MyClass
{
public:
void foo(){ volatile static int a = 0 ; a++ ; }
void bar(){ volatile static int a = 0 ; a++ ; }
void (MyClass::*foobar)() ;
} acc ;
static const int n = 30000 ;
void ifInLoopA( bool specialCase, MyClass& acc ) {
for( auto i = 0; i < n; ++i ) {
for( auto j = 0; j < n; ++j ) {
if( specialCase ) {
acc.foo();
}
else {
acc.bar();
}
}
}
}
void ifInLoopB( bool specialCase, MyClass& acc ) {
if( specialCase ) {
for( auto i = 0; i < n; ++i ) {
for( auto j = 0; j < n; ++j ) {
acc.foo();
}
}
}
else {
for( auto i = 0; i < n; ++i ) {
for( auto j = 0; j < n; ++j ) {
acc.bar();
}
}
}
}
void ifInLoopD( bool specialCase, MyClass& acc )
{
acc.foobar = specialCase ? &MyClass::foo : &MyClass::bar ;
for( auto i = 0; i < n; ++i )
{
for( auto j = 0; j < n; ++j )
{
(acc.*acc.foobar)() ;
}
}
}
#include <ctime>
#include <iostream>
int main()
{
std::clock_t start = std::clock() ;
ifInLoopA( true, acc ) ;
std::cout << "ifInLoopA( true, acc ) : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;
start = std::clock() ;
ifInLoopA( false, acc ) ;
std::cout << "ifInLoopA( false, acc ) : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;
start = std::clock() ;
ifInLoopB( true, acc ) ;
std::cout << "ifInLoopB( true, acc ) : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;
start = std::clock() ;
ifInLoopB( false, acc ) ;
std::cout << "ifInLoopB( false, acc ) : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;
start = std::clock() ;
ifInLoopD( true, acc ) ;
std::cout << "ifInLoopD( true, acc ) : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;
start = std::clock() ;
ifInLoopD( false, acc ) ;
std::cout << "ifInLoopD( false, acc ) : " << static_cast<double>((clock() - start)) / CLOCKS_PER_SEC << " seconds\n" ;
}
-O3
(GCC / Clang) with respect to produced assembly and speed." What's stopping you from compiling the code with relevant flags, and looking at the differences in assembly? – Alejandraalejandrinatemplate<Fnc> apply(Fnc fnc)
and use it likeapply([&](){acc.foo();})
orapply([&](){acc.bar();})
– Mufinellaif
orelse
bodies runs every iteration, but checking at runtime and branching misses out on those optimizations even if it predicts perfectly. – OfrisspecialCase
values for a repeating pattern. – Ofrisflag
" without nasty template syntax, etc. – Stove