Performance hit of vtable lookup in C++

Asked 23/8, 2013 at 14:12 Answered 27/8, 2013 at 7:46

I'm evaluating to rewrite a piece of real-time software from C/assembly language to C++/assembly language (for reasons not relevant to the question parts of the code are absolutely necessary to do in assembly).

An interrupt comes with a 3 kHz frequency, and for each interrupt around 200 different things are to be done in a sequence. The processor runs with 300 MHz, giving us 100,000 cycles to do the job. This has been solved in C with an array of function pointers:

// Each function does a different thing, all take one parameter being a pointer
// to a struct, each struct also being different.
void (*todolist[200])(void *parameters);

// Array of pointers to structs containing each function's parameters.
void *paramlist[200];

void realtime(void)
{
  int i;
  for (i = 0; i < 200; i++)
    (*todolist[i])(paramlist[i]);
}

Speed is important. The above 200 iterations are done 3,000 times per second, so practically we do 600,000 iterations per second. The above for loop compiles to five cycles per iteration, yielding a total cost of 3,000,000 cycles per second, i.e. 1% CPU load. Assembler optimization might bring that down to four instructions, however I fear we might get some extra delay due to memory accesses close to each other, etc. In short, I believe those five cycles are pretty optimal.

Now to the C++ rewrite. Those 200 things we do are sort of related to each other. There is a subset of parameters that they all need and use, and have in their respective structs. In a C++ implementation they could thus neatly be regarded as inheriting from a common base class:

class Base
{
  virtual void Execute();
  int something_all_things_need;
}
class Derived1 : Base
{
  void Execute() { /* Do something */ }
  int own_parameter;
  // Other own parameters
}
class Derived2 : Base { /* Etc. */ }

Base *todolist[200];

void realtime(void)
{
  for (int i = 0; i < 200; i++)
    todolist[i]->Execute(); // vtable look-up! 20+ cycles.
}

My problem is the vtable lookup. I cannot do 600,000 lookups per second; this would account for more than 4% of wasted CPU load. Moreover the todolist never changes during run-time, it is only set up once at start-up, so the effort of looking up what function to call is truly wasted. Upon asking myself the question "what is the most optimal end result possible", I look at the assembler code given by the C solution, and refind an array of function pointers...

What is the clean and proper way to do this in C++? Making a nice base class, derived classes and so on feels pretty pointless when in the end one again picks out function pointers for performance reasons.

Update (including correction of where the loop starts):

The processor is an ADSP-214xx, and the compiler is VisualDSP++ 5.0. When enabling #pragma optimize_for_speed, the C loop is 9 cycles. Assembly-optimizing it in my mind yields 4 cycles, however I didn't test it so it's not guaranteed. The C++ loop is 14 cycles. I'm aware of the compiler could do a better job, however I did not want to dismiss this as a compiler issue - getting by without polymorphism is still preferable in an embedded context, and the design choice still interests me. For reference, here the resulting assembly:

i3=0xb27ba;
i5=0xb28e6;
r15=0xc8;

Here's the actual loop:

r4=dm(i5,m6);
i12=dm(i3,m6);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279de;
r15=r15-1;
if ne jump (pc, 0xfffffff2);

C++ :

i5=0xb279a;
r15=0xc8;

Here's the actual loop:

i5=modify(i5,m6);
i4=dm(m7,i5);
r2=i4;
i4=dm(m6,i4);
r1=dm(0x3,i4);
r4=r2+r1;
i12=dm(0x5,i4);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279e2;
r15=r15-1;
if ne jump (pc, 0xffffffe7);

In the meanwhile, I think I have found sort of an answer. The lowest amount of cycles is achieved by doing the very least possible. I have to fetch a data pointer, fetch a function pointer, and call the function with the data pointer as parameter. When fetching a pointer the index register is automatically modified by a constant, and one can just as well let this constant equal 1. So once again one finds oneself with an array of function pointers, and an array of data pointers.

Naturally, the limit is what can be done in assembly, and that has now been explored. Having this in mind, I now understand that even though it comes natural to one to introduce a base class, it was not really what fit the bill. So I guess the answer is that if one wants an array of function pointers, one should make oneself an array of function pointers...

Quirt answered 23/8, 2013 at 14:12 Comment(13)

I put it as comment, but don't you think that your 'interrupt handler' is overloaded with tasks? Can you implement something like: interrupt handler + N worker threads with dispatcher, so as soon as interrupt arrives you just 'make a note' on important parameters which are coming from interrupt (like timestamp, some hardware data etc) and then notify dispatcher that data arrived, so your threads will start processing them in parallel or based on certain sequence which you can implement like a state machine or whatever.. – Eyewitness 23/8, 2013 at 14:41

Idea is that interrupt code should stay 'as light' as possible to be ready for new data coming in, and processing actually happens somewhere else – Eyewitness 23/8, 2013 at 14:42

actually, the above is slightly simplified. the for loop isn't done in the interrupt, the interrupt merely triggers it to be done outside of the interrupt. the fact remains, only 100.000 cycles are available, then the interrupt comes again. the CPU load is high, regardless of in which context code is run – Quirt 23/8, 2013 at 14:48

Have you a C++11 compiler? – Gambrel 23/8, 2013 at 14:54

@Quirt Can you run those tasks in parallel? – Eyewitness 23/8, 2013 at 14:56

Avoid VTABLES if you are worried about latency, use a template with a T functor in C++ and you avoid the additional lookup – Rodney 23/8, 2013 at 15:15

@Eyewitness As far as the CPU goes, all features alleviating the processor load has been used. The CPU supports instruction-level parallelism, and data-level parallelism. Making use of this efficiently is what the assembler code does. Some filters are HW-accelerated, this has been done as far as possible. The CPU is more or less saturated at 90%+ load, cannot be made use of much more. Even if a more powerful CPU is put in use, the above question remains the same – Quirt 23/8, 2013 at 15:18

@JanHerrmann the manual for the compiler shipped with the DSP in question refers to Bjarne Stroustrup's book from 1997, so I assume C++11 is not supported :S Is there any cool feature in that standard you had in mind? – Quirt 23/8, 2013 at 15:19

There's such thing called abstraction penalty. If you can't afford it, use lower-level constructs, or a lower-level language. – Bowls 23/8, 2013 at 15:20

@user2711077, I really find it hard to believe that more powerful CPU would not solve the vtable overhead problem (if that's the problem). – Foreshow 23/8, 2013 at 15:21

@Foreshow Well of course is solves it for this specific case, but that's not the generic problem I'm addressing here. Assume a more powerful CPU at 90%+ load. It may be twice or ten times faster, assume then that that CPU is saturated. The question remains, what is the clean C++ way of iterating through an array of function pointers? – Quirt 23/8, 2013 at 15:31

@user2711077, can you define 'clean'? – Foreshow 23/8, 2013 at 15:33

@Quirt what kind of DSP are you using? – Eyewitness 23/8, 2013 at 16:57

What makes you think vtable lookup overhead is 20 cycles? If that's really true, you need a better C++ compiler.

I tried this on an Intel box, not knowing anything about the processor you're using, and as expected the difference between the C despatch code and the C++ vtable despatch is one instruction, having to do with the fact that the vtable involves an extra indirect.

C code (based on OP):

void (*todolist[200])(void *parameters);                                  
void *paramlist[200];
void realtime(void)
{       
  int i;
  for (i = 0; i < 200; i++)                                               
    (*todolist[i])(paramlist[i]);                                         
}

C++ code:

class Base {
  public:
    Base(void* unsafe_pointer) : unsafe_pointer_(unsafe_pointer) {}
    virtual void operator()() = 0;
  protected:
    void* unsafe_pointer_;
};

Base* todolist[200];
void realtime() {
  for (int i = 0; i < 200; ++i)
    (*todolist[i])();
}

Both compiled with gcc 4.8, -O3:

realtime:                             |_Z8realtimev:
.LFB0:                                |.LFB3:
        .cfi_startproc                |        .cfi_startproc
        pushq   %rbx                  |        pushq   %rbx
        .cfi_def_cfa_offset 16        |        .cfi_def_cfa_offset 16
        .cfi_offset 3, -16            |        .cfi_offset 3, -16
        xorl    %ebx, %ebx            |        movl    $todolist, %ebx
        .p2align 4,,10                |        .p2align 4,,10
        .p2align 3                    |        .p2align 3
.L3:                                  |.L3:
        movq    paramlist(%rbx), %rdi |        movq    (%rbx), %rdi
        call    *todolist(%rbx)       |        addq    $8, %rbx
        addq    $8, %rbx              |        movq    (%rdi), %rax
                                      |        call    *(%rax)
        cmpq    $1600, %rbx           |        cmpq    $todolist+1600, %rbx
        jne     .L3                   |        jne     .L3
        popq    %rbx                  |        popq    %rbx
        .cfi_def_cfa_offset 8         |        .cfi_def_cfa_offset 8
        ret                           |        ret

In the C++ code, the first movq gets the address of the vtable, and the call then indexes through that. So that's one instruction overhead.

According to OP, the DSP's C++ compiler produces the following code. I've inserted comments based on my understanding of what's going on (which might be wrong). Note that (IMO) the loop starts one location earlier than OP indicates; otherwise, it makes no sense (to me).

# Initialization.
# i3=todolist; i5=paramlist           | # i5=todolist holds paramlist
i3=0xb27ba;                           | # No paramlist in C++
i5=0xb28e6;                           | i5=0xb279a;
# r15=count
r15=0xc8;                             | r15=0xc8;

# Loop. We need to set up r4 (first parameter) and figure out the branch address.
# In C++ by convention, the first parameter is 'this'
# Note 1:
r4=dm(i5,m6); # r4 = *paramlist++;    | i5=modify(i5,m6); # i4 = *todolist++
                                      | i4=dm(m7,i5);     # ..
# Note 2:                            
                                      | r2=i4;            # r2 = obj
                                      | i4=dm(m6,i4);     # vtable = *(obj + 1)
                                      | r1=dm(0x3,i4);    # r1 = vtable[3]
                                      | r4=r2+r1;         # param = obj + r1

i12=dm(i3,m6); # i12 = *todolist++;   | i12=dm(0x5,i4);   # i12 = vtable[5]

# Boilerplate call. Set frame pointer, push return address and old frame pointer.
# The two (push) instructions after jump are actually executed before the jump.
r2=i6;                                | r2=i6;
i6=i7;                                | i6=i7;
jump (m13,i12) (db);                  | jump (m13,i12) (db);
dm(i7,m7)=r2;                         | dm(i7,m7)=r2;
dm(i7,m7)=0x1279de;                   | dm(i7,m7)=0x1279e2;

# if (count--) loop
r15=r15-1;                            | r15=r15-1;
if ne jump (pc, 0xfffffff2);          | if ne jump (pc, 0xffffffe7);

Notes:

In the C++ version, it seems that the compiler has decided to do the post-increment in two steps, presumably because it wants the result in an i register rather than in r4. This is undoubtedly related to the issue below.
The compiler has decided to compute the base address of the object's real class, using the object's vtable. This occupies three instructions, and presumably also requires the use of i4 as a temporary in step 1. The vtable lookup itself occupies one instruction.

So: the issue is not vtable lookup, which could have been done in a single extra instruction (but actually requires two). The problem is that the compiler feels the need to "find" the object. But why doesn't gcc/i86 need to do that?

The answer is: it used to, but it doesn't any more. In many cases (where there is no multiple inheritance, for example), the cast of a pointer to a derived class to a pointer of a base class does not require modifying the pointer. Consequently, when we call a method of the derived class, we can just give it the base class pointer as its this parameter. But in other cases, that doesn't work, and we have to adjust the pointer when we do the cast, and consequently adjust it back when we do the call.

There are (at least) two ways to perform the second adjustment. One is the way shown by the generated DSP code, where the adjustment is stored in the vtable -- even if it is 0 -- and then applied during the call. The other way, (called vtable-thunks) is to create a thunk -- a little bit of executable code -- which adjusts the this pointer and then jumps to the method's entry point, and put a pointer to this thunk into the vtable. (This can all be done at compile time.) The advantage of the thunk solution is that in the common case where no adjustment needs to be done, we can optimize away the thunk and there is no adjustment code left. (The disadvantage is that if we do need an adjustment, we've generated an extra branch.)

As I understand it, VisualDSP++ is based on gcc, and it might have the -fvtable-thunks and -fno-vtable-thunks options. So you might be able to compile with -fvtable-thunks. But if you do that, you would need to compile all the C++ libraries you use with that option, because you cannot mix the two calling styles. Also, there were (15 years ago) various bugs in gcc's vtable-thunks implementation, so if the version of gcc used by VisualDSP++ is old enough, you might run into those problems too (IIRC, they all involved multiple inheritance, so they might not apply to your use case.)

(Original test, before update):

I tried the following simple case (no multiple inheritance, which can slow things down):

class Base {
  public:
    Base(int val) : val_(val) {}
    virtual int binary(int a, int b) = 0;
    virtual int unary(int a) = 0;
    virtual int nullary() = 0;
  protected:
    int val_;
};

int binary(Base* begin, Base* end, int a, int b) {
  int accum = 0;
  for (; begin != end; ++begin) { accum += begin->binary(a, b); }
  return accum;
}

int unary(Base* begin, Base* end, int a) {
  int accum = 0;
  for (; begin != end; ++begin) { accum += begin->unary(a); }
  return accum;
}

int nullary(Base* begin, Base* end) {
  int accum = 0;
  for (; begin != end; ++begin) { accum += begin->nullary(); }
  return accum;
}

And compiled it with gcc (4.8) using -O3. As I expected, it produced exactly the same assembly code as your C despatch would have done. Here's the for loop in the case of the unary function, for example:

.L9:
        movq    (%rbx), %rax
        movq    %rbx, %rdi
        addq    $16, %rbx
        movl    %r13d, %esi
        call    *8(%rax)
        addl    %eax, %ebp
        cmpq    %rbx, %r12
        jne     .L9

Knee answered 23/8, 2013 at 15:38 Comment(12)

i'm on a different processor, actually it's a DSP optimized for 32-bit floating point. there's no out-of-order execution, i think there's hardly even branch prediction, so there not much dynamic help to expect from the CPU. regarding the number of cycles, to be honest i exaggerated a bit, it is rather 10+ instructions. the fact remains, the vtable lookup is done for each iteration, causing a non-negligable penalty – Quirt 23/8, 2013 at 15:52

@user2711077: I assumed you were on a different processor, so I have no way of testing. But the i386 isn't taking much advantage of out-of-order execution here: in the loop above, the only real advantage is incrementing %rbx before the call. The difference between that and the C despatch is essentially one instruction, the first movq, because it needs to do one extra indirection. Can one indirection cost 10 cycles on your DSP? – Knee 23/8, 2013 at 16:2

@user2711077: OK, changed the example to exactly your code. How do those two compile on your architecture, using appropriate optimization settings? – Knee 23/8, 2013 at 16:20

@user2711077: Actually, given the ASM in this answer, and the fact you say your compiler isn't very good, you may consider manually unrolling your loop in addition to any other answer you find. Manually unrolling tight loops might make a slight difference. – Secede 23/8, 2013 at 17:51

@Knee while i'd go more for the template approach, i upvoted your answer since it shows exactly what the compiler has been done with the code. the access to the vtable is really fast, with a decent compiler. why i'd prefer the template approach is, that i don't see any need, especially on embedded systems, to use polymorphism...(for other areas as well, but this is another discussion). – Gothar 23/8, 2013 at 18:53

@MooingDuck: Good point. Compiling with -funroll-loops turned the loop into 8 repetitions of movq/movq/callq (C++), which is probably a win. – Knee 23/8, 2013 at 19:33

Good analysis of yours. You have a very valid point in that the compiler doesn't do the best job possible (see update above). I'm up-voting your answer, however I was more looking for alternative ways of doing things. Just dismissing it as a compiler issue and carrying on didn't feel like going to the end with exploring the problem and its various solutions – Quirt 26/8, 2013 at 16:29

@user2711077. Thanks. I took a look at your edit, and edited my post accordingly; I think the cost you're seeing is pointer adjustment and not "vtable lookup", although admittedly the difference is somewhat subtle. Still, it's always useful to be clear about what is going on. – Knee 26/8, 2013 at 21:26

On in-order processors, the pipeline bubble from a branch mispredict absolutely can cause a 20 cycle delay. I measured it directly: assemblyrequired.crashworks.org/2009/01/19/… This is to be expected when the branch resolution stage of the pipeline is 20 stages ahead of the instruction fetch. – Terrazzo 26/8, 2013 at 21:28

@Crashworks: yes, but: (1) we're not comparing virtual function call with direct function call. We're comparing it with an indirect function call, albeit with one level less indirection. There is still exactly the same misprediction penalty, because it is still the case that the same call site is branching to different, unpredictable locations. And (2): The DSP does not have branch prediction. It does, however, have two delay slots following the jump, and they are both being used by pre-call stack adjustment operations, so the branch delay is fully utilized. – Knee 26/8, 2013 at 21:32

so for those who don't want to read the entire message - what is the conclusion? on the modern PC what is average vtbl lookup overhead in nanoseconds? – Bridgeboard 3/4, 2015 at 20:37

@javapowered: tl;dr version: somewhere between nothing and almost nothing. You won't notice the difference. – Knee 3/4, 2015 at 21:20

As has already been mentioned, you can use templates to do away with the dynamic dispatch. Here is an example that does this:

template <typename FirstCb, typename ... RestCb>
struct InterruptHandler {
    void execute() {
        // I construct temporary objects here since I could not figure out how you
        // construct your objects. You can change these signatures to allow for 
        // passing arbitrary params to these handlers.
        FirstCb().execute();
        InterruptHandler<RestCb...>().execute();
    }
}

InterruptHandler</* Base, Derived1, and so on */> handler;

void realtime(void) {
    handler.execute();
}

This should completely eliminate the vtable lookups while providing more opportunities for compiler optimization since the code inside execute can be inlined.

Note however that you will need to change some parts depending on how you initialize your handlers. The basic framework should remain the same. Also, this requires that you have a C++11 compliant compiler.

Allomorphism answered 23/8, 2013 at 15:21 Comment(0)

I suggest using static methods in your derived classes and placing these functions into your array. This would eliminate the overhead of the v-table search. This is closest to your C language implementation.

You may end up sacrificing the polymorphism for speed.
Is the inheritance necessary?
Just because you switch to C++ doesn't mean you have to switch to Object Oriented.

Also, have you tried unrolling your loop in the ISR?
For example, perform 2 or more execution calls before returning back to the top of the loop.

Also, can you move any of the functionality out of the ISR? Can any part of the functionality be performed by the "background loop" instead of the ISR? This would reduce the time in your ISR.

Hamo answered 23/8, 2013 at 15:34 Comment(1)

Yay for the simple fast solution. Just because you have a C++ compiler doesn't mean you need to use all those shiny bells and whistles. – Edson 24/12, 2013 at 20:36

You can hide the void* type erasure and type recovery inside templates. The result would (hopefully) be the same array to function pointers. This yould help with casting and compatible to your code:

#include <iostream>

template<class ParamType,class F>
void fun(void* param) {
  F f;
  f(*static_cast<ParamType*>(param));
}

struct my_function {
  void operator()(int& i) {
    std::cout << "got it " << i << std::endl;
  }
};


int main() {
  void (*func)(void*) = fun<int, my_function>;

  int j=4;
  func(&j);

  return 0;
}

In this case you can create new functions as a function object with more type safty. The "normal" OOP aproach with virtual functions doesn't help here.

In case of A C++11 environment you could create the array with help of variadic templates at compile time (but with an complicated syntax).

Gambrel answered 23/8, 2013 at 15:33 Comment(0)

Find out where your compiler puts the vtable and access it directly to get the function pointers and store them for usage. That way you will have pretty much the same approach like in C with an array of function pointers.

Foreshow answered 23/8, 2013 at 15:10 Comment(9)

Dear God, please don't allow such things to happen on Your green earth. Thank you. – Bowls 23/8, 2013 at 15:26

Very dangerous if compiler vendors or compiler versions change. – Hamo 23/8, 2013 at 15:26

@Thomas, it's not dangerous at all. If you do it, you know what you have to take care about. Once the object is created, the vtable is filled with values and it doesn't change, just like it's location doesn't change. If you shave yourself with a razor, yes you can cut yourself if you are clumsy, but you can also use an electric shaver. – Foreshow 23/8, 2013 at 15:31

"Balancing a car on this vertical pole is fine, as long as you don't screw up" – Secede 23/8, 2013 at 17:49

@user1764961: Just because your executable file doesn't change doesn't mean the code path is the same. Libraries and DLLs are updated all the time. Also, if the compiler changes, or you change a compiler option, or it's compiled at a different time of day, the vtable might be different, making all your hard work crash. – Secede 24/8, 2013 at 15:34

1. you don't change compiler 15 times per day. 2. Even when you do that, you can see if the location has been changed. 3. If you just upgraded the current compiler, the vtbl will most probably be on the same place. For example, MS didn't change vtbl location for decades. If you would be writing a compiler, would you implement vtbl moving in every build as you suggest? I don't think so. So, lets move from all theoretical possibilities to practice. – Foreshow 24/8, 2013 at 16:3

Don't forget this is the embedded world. Code here changes rarely, and c/c++ compiler is taken more like an advanced assembler. It is a tradeoff you might need to make to work around a badly optimizing c++ compiler. I'm not saying this is the way to go, but it's definitely a candidate. – Fifield 8/7, 2014 at 7:56

While this won't work if your LLD defines your compiler as all the things, for someone working on device driver optimisation for a well defined platform I don't a problem. __asm__ exists in most C++ compilers for a reason. – Coronet 6/8, 2020 at 10:6

Added a +1 just to balance out the scales - not something I would ever try but I'm not an embedded developer. – Staffan 16/3, 2023 at 22:59

This is unrelated to your question, but if you are that keen on performance you could use templates to do a loop unroll for the todolist:

void (*todo[3])(void *);
void *param[3];

void f1(void*) {std::cout<<"1" << std::endl;}
void f2(void*) {std::cout<<"2" << std::endl;}
void f3(void*) {std::cout<<"3" << std::endl;}

template<int N>
struct Obj {
    static void apply()
    {
        todo[N-1](param[N-1]);
        Obj<N-1>::apply();
    }
};

template<> struct Obj<0> { static void apply() {} };

todo[0] = f1;
todo[1] = f2;
todo[2] = f3;

Obj<sizeof todo / sizeof *todo>::apply();

Assistance answered 27/8, 2013 at 7:46 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags