This answer is intended to contribute, to the set of existing answers, what I believe to be a more meaningful benchmark for the runtime cost of std::function calls.
The std::function mechanism should be recognized for what it provides: Any callable entity can be converted to a std::function of appropriate signature. Suppose you have a library that fits a surface to a function defined by z = f(x,y), you can write it to accept a std::function<double(double,double)>
, and the user of the library can easily convert any callable entity to that; be it an ordinary function, a method of a class instance, or a lambda, or anything that is supported by std::bind.
Unlike template approaches, this works without having to recompile the library function for different cases; accordingly, little extra compiled code is needed for each additional case. It has always been possible to make this happen, but it used to require some awkward mechanisms, and the user of the library would likely need to construct an adapter around their function to make it work. std::function automatically constructs whatever adapter is needed to get a common runtime call interface for all the cases, which is a new and very powerful feature.
To my view, this is the most important use case for std::function as far as performance is concerned: I'm interested in the cost of calling a std::function many times after it has been constructed once, and it needs to be a situation where the compiler is unable to optimize the call by knowing the function actually being called (i.e. you need to hide the implementation in another source file to get a proper benchmark).
I made the test below, similar to the OP's; but the main changes are:
- Each case loops 1 billion times, but the std::function objects are constructed only once. I've found by looking at the output code that 'operator new' is called when constructing actual std::function calls (maybe not when they are optimized out).
- Test is split into two files to prevent undesired optimization
- My cases are: (a) function is inlined (b) function is passed by an ordinary function pointer (c) function is a compatible function wrapped as std::function (d) function is an incompatible function made compatible with a std::bind, wrapped as std::function
The results I get are:
Case (d) tends to be slightly slower, but the difference (about 0.05 nsec) is absorbed in the noise.
Conclusion is that the std::function is comparable overhead (at call time) to using a function pointer, even when there's simple 'bind' adaptation to the actual function. The inline is 2 ns faster than the others but that's an expected tradeoff since the inline is the only case which is 'hard-wired' at run time.
When I run johan-lundberg's code on the same machine, I'm seeing about 39 nsec per loop, but there's a lot more in the loop there, including the actual constructor and destructor of the std::function, which is probably fairly high since it involves a new and delete.
-O2 gcc 4.8.1, to x86_64 target (core i5).
Note, the code is broken up into two files, to prevent the compiler from expanding the functions where they are called (except in the one case where it's intended to).
----- first source file --------------
#include <functional>
// simple funct
float func_half( float x ) { return x * 0.5; }
// func we can bind
float mul_by( float x, float scale ) { return x * scale; }
//
// func to call another func a zillion times.
//
float test_stdfunc( std::function<float(float)> const & func, int nloops ) {
float x = 1.0;
float y = 0.0;
for(int i =0; i < nloops; i++ ){
y += x;
x = func(x);
}
return y;
}
// same thing with a function pointer
float test_funcptr( float (*func)(float), int nloops ) {
float x = 1.0;
float y = 0.0;
for(int i =0; i < nloops; i++ ){
y += x;
x = func(x);
}
return y;
}
// same thing with inline function
float test_inline( int nloops ) {
float x = 1.0;
float y = 0.0;
for(int i =0; i < nloops; i++ ){
y += x;
x = func_half(x);
}
return y;
}
----- second source file -------------
#include <iostream>
#include <functional>
#include <chrono>
extern float func_half( float x );
extern float mul_by( float x, float scale );
extern float test_inline( int nloops );
extern float test_stdfunc( std::function<float(float)> const & func, int nloops );
extern float test_funcptr( float (*func)(float), int nloops );
int main() {
using namespace std::chrono;
for(int icase = 0; icase < 4; icase ++ ){
const auto tp1 = system_clock::now();
float result;
switch( icase ){
case 0:
result = test_inline( 1e9);
break;
case 1:
result = test_funcptr( func_half, 1e9);
break;
case 2:
result = test_stdfunc( func_half, 1e9);
break;
case 3:
result = test_stdfunc( std::bind( mul_by, std::placeholders::_1, 0.5), 1e9);
break;
}
const auto tp2 = high_resolution_clock::now();
const auto d = duration_cast<milliseconds>(tp2 - tp1);
std::cout << d.count() << std::endl;
std::cout << result<< std::endl;
}
return 0;
}
For those interested, here's the adaptor the compiler built to make 'mul_by' look like a float(float) - this is 'called' when the function created as bind(mul_by,_1,0.5) is called:
movq (%rdi), %rax ; get the std::func data
movsd 8(%rax), %xmm1 ; get the bound value (0.5)
movq (%rax), %rdx ; get the function to call (mul_by)
cvtpd2ps %xmm1, %xmm1 ; convert 0.5 to 0.5f
jmp *%rdx ; jump to the func
(so it might have been a bit faster if I'd written 0.5f in the bind...)
Note that the 'x' parameter arrives in %xmm0 and just stays there.
Here's the code in the area where the function is constructed, prior to calling test_stdfunc - run through c++filt :
movl $16, %edi
movq $0, 32(%rsp)
call operator new(unsigned long) ; get 16 bytes for std::function
movsd .LC0(%rip), %xmm1 ; get 0.5
leaq 16(%rsp), %rdi ; (1st parm to test_stdfunc)
movq mul_by(float, float), (%rax) ; store &mul_by in std::function
movl $1000000000, %esi ; (2nd parm to test_stdfunc)
movsd %xmm1, 8(%rax) ; store 0.5 in std::function
movq %rax, 16(%rsp) ; save ptr to allocated mem
;; the next two ops store pointers to generated code related to the std::function.
;; the first one points to the adaptor I showed above.
movq std::_Function_handler<float (float), std::_Bind<float (*(std::_Placeholder<1>, double))(float, float)> >::_M_invoke(std::_Any_data const&, float), 40(%rsp)
movq std::_Function_base::_Base_manager<std::_Bind<float (*(std::_Placeholder<1>, double))(float, float)> >::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation), 32(%rsp)
call test_stdfunc(std::function<float (float)> const&, int)
std::function
if and only if you actually need a heterogeneous collection of callable objects (i.e no further discriminating information is available at runtime). – Phylestd::function
or templates". I think here the issue is simply wrapping a lambda instd::function
vs not wrapping a lambda instd::function
. At the moment your question is like asking "should I prefer an apple, or a bowl?" – Ferdinanaboost::function
, it just takes 100% longer than the template version (2 times) (with GCC. Clang took 0ns for the template version. it appears to have optimized it away). I suspect you should specify the implementation that you use in the benchmark. – Disarmamentstd::sort
. In such cases, I prefer templates (which are then mostly involved anyway). But in other cases, it really does not matter, – Drouinstd::function
when you need one. A bunch of disadvantages of plain template functor arguments doesn't magically makestd::function
(which has its own disadvantages, as you have seen yourself) the "de facto standard". – Moonystd::function
when you need one" - The problem is that this is nonsense, becausestd::function
is a class template. – Ferdinanastd::function
, ignoring that this usage of "template" was a bit imprecise. Of course astd::function
is a template, too, but in the end it is "more specialized" / "less templated" than a bare template argument. – Moonynow()
should be onhigh_resolution_clock
. This error has propagated to all the snippets below! – Bedmate