c++ virtual function vs member function pointer (performance comparison)

Asked 27/6, 2013 at 13:39 Answered 18/2, 2015 at 20:36

c++performance function-pointers virtual-functions vtable

Virtual function calls can be slow due to virtual calls requiring an extra indexed deference to the v-table, which can result in a data cache miss as well as an instruction cache miss... Not good for performance critical applications.

So I have been thinking of a way to overcome this performance issue of virtual functions yet still having some of the same functionality that virtual functions provide.

I am confident that this has been done before, but I devised a simple test that allows the base class to store a member function pointer that can be set by any the derived class. And when I call Foo() on any derived class, it will call the appropriate member function without having to traverse the v-table...

I am just wondering if this method is a viable replacement for the virtual-call paradigm, if so, why is it not more ubiquitous?

Thanks in advance for your time! :)

class BaseClass
{
protected:

    // member function pointer
    typedef void(BaseClass::*FooMemFuncPtr)();
    FooMemFuncPtr m_memfn_ptr_Foo;

    void FooBaseClass() 
    {
        printf("FooBaseClass() \n");
    }

public:

    BaseClass()
    {
        m_memfn_ptr_Foo = &BaseClass::FooBaseClass;
    }

    void Foo()
    {
        ((*this).*m_memfn_ptr_Foo)();
    }
};

class DerivedClass : public BaseClass
{
protected:

    void FooDeriveddClass()
    {
        printf("FooDeriveddClass() \n");
    }

public:

    DerivedClass() : BaseClass()
    {
        m_memfn_ptr_Foo = (FooMemFuncPtr)&DerivedClass::FooDeriveddClass;
    }
};

int main(int argc, _TCHAR* argv[])
{
    DerivedClass derived_inst;
    derived_inst.Foo(); // "FooDeriveddClass()"

    BaseClass base_inst;
    base_inst.Foo(); // "FooBaseClass()"

    BaseClass * derived_heap_inst = new DerivedClass;
    derived_heap_inst->Foo();

    return 0;
}

Vigor answered 27/6, 2013 at 13:39 Comment(7)

1. Please profile the code before you ask questions like these. What you're basically saying is "profile the code for me". 2. Look up compile-time polymorphism. – Scrag 27/6, 2013 at 13:43

Old but might be intresting for you: codeproject.com/Articles/7150/… – Tailband 27/6, 2013 at 13:43

yes, I plan to profile the code, but I was curious if there are any conceptual differences in performance – Vigor 27/6, 2013 at 13:53

"why is it not more ubiquitous?" For the same reason that assembly language and Brainf*** aren't ubiquitous ... oh, and it's slower. – Idolater 27/6, 2013 at 14:7

care to explain why it is slower? – Vigor 27/6, 2013 at 14:11

You're paying for the (potential) saving of one cache miss by storing one function pointer per object instead of one per class and the loss of immutability (i.e. predictability) of the function pointer. I'm sure that this tradeoff has been measured repeatedly in order to arrive at the common implementation of virtual calls. Alternatively, all C++ implementers are nitwits that just haven't thought of doing it that way, but I somehow doubt that. – Sunda 27/6, 2013 at 14:34

The reason virtual functions exist is to allow a decision of what code to execute to be made as late as possible, based on object type. It's a modular alternative to a switch statement or if-ladder. If that's what your program needs, then use it. If not, don't. – Erythromycin 27/6, 2013 at 14:45

I did a test, and the version using virtual function calls was faster on my system with optimization.

$ time ./main 1
Using member pointer

real    0m3.343s
user    0m3.340s
sys     0m0.002s

$ time ./main 2
Using virtual function call

real    0m2.227s
user    0m2.219s
sys     0m0.006s

Here is the code:

#include <cstdlib>
#include <cstring>
#include <iostream>
#include <stdio.h>

struct BaseClass
{
    typedef void(BaseClass::*FooMemFuncPtr)();
    FooMemFuncPtr m_memfn_ptr_Foo;

    void FooBaseClass() { }

    BaseClass()
    {
        m_memfn_ptr_Foo = &BaseClass::FooBaseClass;
    }

    void Foo()
    {
        ((*this).*m_memfn_ptr_Foo)();
    }
};

struct DerivedClass : public BaseClass
{
    void FooDerivedClass() { }

    DerivedClass() : BaseClass()
    {
        m_memfn_ptr_Foo = (FooMemFuncPtr)&DerivedClass::FooDerivedClass;
    }
};

struct VBaseClass {
  virtual void Foo() = 0;
};

struct VDerivedClass : VBaseClass {
  virtual void Foo() { }
};

static const size_t count = 1000000000;

static void f1(BaseClass* bp)
{
  for (size_t i=0; i!=count; ++i) {
    bp->Foo();
  }
}

static void f2(VBaseClass* bp)
{
  for (size_t i=0; i!=count; ++i) {
    bp->Foo();
  }
}

int main(int argc, char** argv)
{
    int test = atoi(argv[1]);
    switch (test) {
        case 1:
        {
            std::cerr << "Using member pointer\n";
            DerivedClass d;
            f1(&d);
            break;
        }
        case 2:
        {
            std::cerr << "Using virtual function call\n";
            VDerivedClass d;
            f2(&d);
            break;
        }
    }

    return 0;
}

Compiled using:

g++ -O2    main.cpp   -o main

with g++ 4.7.2.

Appendant answered 27/6, 2013 at 14:2 Comment(2)

very interesting.. thanks very much for profiling it! i wonder how much this has to do with the vtable and instructions being fresh in cache – Vigor 27/6, 2013 at 14:14

This could also be due to the fact that virtual tables have been in C++ for a long time and hence compiler writers have learned how and when to optimise them. Where as with your code the compiler can make fewer assumptions and has to do the less optimal thing. – Armchair 27/6, 2013 at 14:21

Virtual function calls can be slow due to virtual calls having to traverse the v-table,

That's not quite correct. The vtable should be computed on object construction, with each virtual function pointer set to the most specialized version in the hierarchy. The process of calling a virtual function does not iterate pointers but call something like *(vtbl_address + 8)(args);, which is computed in constant time.

which can result in a data cache miss as well as an instruction cache miss... Not good for performance critical applications.

Your solution is not good for performance critical applications (in general) either, because it is generic.

As a rule, performance critical applications are optimized on a per-case basis (measure, pick code with worst performance problems within module and optimize).

With this per-case approach, you will probably never have a case where your code is slow because the compiler has to traverse a vtbl. If that is the case, the slowness would probably come from calling functions through pointers instead of directly (i.e. the problem would be solved by inlining, not by adding an extra pointer in the base class).

All this is academic anyway, until you have a concrete case to optimize (and you have measured that your worst offender is virtual function calls).

Edit:

I am just wondering if this method is a viable replacement for the virtual-call paradigm, if so, why is it not more ubiquitous?

Because it looks like a generic solution (applying it ubiquitously would decrease performance instead of improving it), solving a non-existent problem (your application is generally not slowed down due to virtual function calls).

Acciaccatura answered 27/6, 2013 at 13:58 Comment(0)

Virtual functions do not "traverse" the table, just do a single fetch of a pointer from a location and call that address. That as if you had a manual implementation of a pointer-to-funciton and used that for a call instead of a direct one.

So your work is only good for obfuscation, and sabotage the cases where the compiler can issue nonvirtual direct call.

Using a pointer-to-memberfunction is probably even worse than PTF, it will likely use the same VMT structure for an similar offseted access, just a variable one instead of fixed.

Empale answered 27/6, 2013 at 13:43 Comment(4)

They have to additionally fetch the vptr, which may or may not cause the same cacheline to be loaded, depending on what the function does. – Tailband 27/6, 2013 at 13:44

true, but caches are large and code is small.. a context switch really is a bigger threat than a v-table. – Disciple 27/6, 2013 at 13:47

It's not very feasible to speculate about cache misses, but measure. But the presented alternative looks to have at least the same amount of indirections... – Empale 27/6, 2013 at 13:49

@PlasmaHH: see actual measurements in another answer by VC – Empale 27/6, 2013 at 14:9

Mostly because it doesn't work. Most modern CPUs are better at branch prediction and speculative execution than you think. However I have yet to see a CPU that do speculative execution beyond a non-static branch.

Furthermore in a modern CPU you are more likely to have a cache miss because you had a context switch just prior to the call and another program took over the cache than you are because of a v-table, even this scenario is a very remote possiblity.

Disciple answered 27/6, 2013 at 13:46 Comment(2)

Thank you very much for your answer, but can you explain what do you mean it "doesn't work" and how does branch prediction come into play for my specific example? – Vigor 27/6, 2013 at 13:55

Basically it's a pre-optimization that gets preempted but the fact that you're running the code on a modern non-cooperative multitasking Operating system. Furthermore the TLB is really good at being predictive of what is going to get used next, and because functions in a vTable tend to be in the same codepage they will almost always be in cache. – Disciple 27/6, 2013 at 13:58

Actually some compilers may use thunks, which translate to ordinary function pointers themselves, so basically the compiler does for you what you are trying to do manually (and probably confuse the hell out of people).

Also, having a pointer to virtual function table, the space complexity of virtual function is O(1) (just the pointer). On the other hand, if you store function pointers within the class, then the complexity is O(N) (your class now contains as many pointers as there are "virtual" functions). If there are many functions, you are paying toll for that - when pre-fetching your object, you are loading all the pointers in the cache line, instead of just a single pointer and the first few members which you are likely to need. That sounds like a waste.

The virtual function table, on the other hand, sits in one place for all the objects of one type and is likely never pushed out of the cache while your code calls some short virtual functions in a loop (which is presumably the problem when virtual function cost would become the bottleneck).

As to the branch prediction, in some cases a simple decision tree over object type and inlined functions for each particular type give good performance (then you store type information instead of a pointer). This is not applicable to all types of problems and would be mostly a premature optimization.

As a rule of thumb, don't worry about the language constructs because they seem unfamiliar. Worry and optimize only after you have measured and identified where the bottleneck really is.

Twiddle answered 18/2, 2015 at 20:36 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags