Avoiding repeated C++ virtual table lookup
Asked Answered
T

2

6

I have C++ program that reads a config file when the binary is executed, creates a number of child class instances based on the config file, and then periodically iterates over these instances and calls their respective virtual functions.

Gprof is telling me that these function calls are taking up a lot of time (the aforementioned iteration happens very frequently), so I want to try to avoid the repeated virtual function calls somehow.

The code is similar to the following. Once the program populates vector v at the start of the program, this vector won't change anymore for the rest of the program, so it seems inefficient to repeatedly have to do a virtual table lookup every time I want to call f(). I would think there must be a way to cache or save the function pointers somehow, but I'm not sure how.

Would love any suggestions you have on speeding things up. Thank you!

Edit: Sorry, I forgot to mention that the function calls f() for the vector of Child instances has to be in order from 0 to v.size() - 1, so I can't group together the elements of v that have the same derived type.

Also, this was built with -O3 -std=c++14

class Parent {
public:
    virtual void f() { } 
};

class Child1 : public Parent {
public:
    void f() { /* do stuff for child1 */ }
};

//...

class Child9 : public Parent {
public:
    void f() { /* do stuff for child9 */ }
};

int main() {
    vector<Parent*> v;

    // read config file and add Child instances to v based on the file contents

    while (true) {
        // do other stuff
        for (size_t i = 0; i != v.size(); ++i) {
             v[i]->f(); // expensive to do the same virtual table lookups every loop!
        }
    }
};
Threadfin answered 21/3, 2020 at 0:16 Comment(11)
Mathieu Ropert has a nice blog post titled Polymorphic Ducks on how to do static polymorphism in C++.Malvern
One important thing that should be added to the question: how was this compiled and linked? (Was the optimizer enabled? Was LTO or WPO enabled?)Malvern
The repeated calls to v.size() are likely to have more of a performance hit than a vtable lookup.Heterosexuality
My apologies, I forgot to specify that the function calls to f() for the vector of children has to be in order (see the edit to my original post). Would static polymorphism still work given this constraint?Threadfin
Is it telling you that the lookup (Which has similar cost to calling through a pointer to a function pointer) is taking a long time, or calling the function itself?Winery
I'm not a gprof expert, but I see the function names on the right that take up a lot of time (not sure how to tell that the lookup specifically takes a lot of time), and the functions f() themselves are very simple, for example one of them just multiplies the values at two memory addresses and stores the product in a third addressThreadfin
For an experiment, mock up a non-virtual example to compare against. Maybe use a config file that requires only Child1, make f() non-virtual, and change v to a vector<Child1 *>? Then switch f() back to virtual to see what the impact of the vtable lookup is?Yseult
Your Parent should have a virtual ~Parent() {}!Doak
@Malvern Sorry but all that sugar in that article, while being nice, doesn't change the fact that at the end of the day they don't get around the virtual call, so what's the point, really?Doak
When the "do other stuff" doesn't do much and the f() functions are "pretty simple" then, yeah, the final loop that is doing v[i]->f() will dominate. That doesn't mean that the call of virtual functions is a concern - it means there is no little going on that the function calls become a significant part of what the code does at run time. If you're worried about hogging the CPU, then judiciuosly add a few sleeps or delays in to your loops (even a few milliseconds here and there will free up CPU cycles).Electrify
@Doak • I see your point. The objects are made with a vfntable, but (in the example) the drawable concept_t and derived model_t do have a vfntable, and are used to glue the behavior to the actual objects that have associated freestanding functions (but not member functions). The SO Q&A CRTP static polymorphism might be more on point. Too bad the CRTP syntax is mightily convoluted, and virtual method syntax is so easy.Malvern
B
3

Based on some of the questions and your answers in the comments, here are a couple of considerations.

1) Your problem (if there is one, your solution might already be close to optimal, depending on details you have not mentioned) is most likely somewhere else, not in the overhead of a virtual function call.

If you really run this in a tight loop, and there's not much going on in the implementations of f() that touches a lot of memory, your vtables probably remain in the L1 cache, and the virtual function call overhead will be absolutely minimal, if any, on modern hardware.

2) You say "the functions f() themselves are very simple, for example one of them just multiplies the values at two memory addresses and stores the product in a third address" - this might not be as innocent as you expect. For reference, going to L1 cache will cost you about 3 cycles, going to RAM may cost as much as 60-200, depedning on your hardware.

If you have enough of these objects (so that keeping all of the memory they reference in L1 cache is not possible), and the memory locations they reference are basically random (so that prefetching is ineffective), and/or you touch enough things in the rest of your program (so that all the relevant data gets vacated from cache between the loops over your vector), the cost of fetching and storing the values from and to memory/lower levels of cache will outweigh the cost of the virtual function calls by orders of magnitude in the worst case.

3) You iterate over a vector of pointers to objects - not the objects themselves.

Depending on how you allocate the objects and how big they are, this might not be an issue - prefetching will do wonders for you if you allocate them in a tight loop and your allocator packs them nicely. If, however, you allocate/free a lot of other things and mix in the allocations of these objects in between, they may end up located sparsely and in basically random locations in memory; then iterating over them in the order of creation will involve a lot random reads from memory, which will again be far slower than any virtual function overhead.

4) You say "calls to f() for the vector of children has to be in order" - do they?

If they do, then you are out of luck in some ways. If, however, you can re-architect your system so that they can be called ordered by type, then there is a lot of speed to be gained in various aspects - you could probably allocate an array of each type of object (nice, dense packing in memory), iterate over them in order (prefetcher friendly), and call your f()'s in groups for a single, well known type (inlining friendly, instruction cache friendly).

5) And finally - if none of the above applies and your problem is really in virtual function calls (unlikely), then, yes, you can try storing a pointer to the exact function you need to call for each object in some fashion - either manually or by using one of the type erasure / duck typing methods others have suggested.

My main point is this - there a lot of performance benefits to be had from changing the architecture of your system in some ways.

Remember: accessing things that are already in L1/L2 cache is good, having to go to L3/RAM for data is worse; accessing memory in a sequential order is good, jumping all over memory is bad; calling the same method in a tight loop, potentially inlining it, is good, calling a lot of different methods in a tight loop is worse.

If this is a part of your program the performance of which really matters, you should consider changing the architecture of your system to allow for some of the previously mentioned optimizations. I know this may seem daunting, but that is the game we are playing. Sometimes you need to sacrifice "clean" OOP and abstractions for performance, if the problem you are solving allows for it.

Bickford answered 21/3, 2020 at 2:20 Comment(1)
Thanks for the insight. I've done further testing and believe (2) is the culprit actually. I changed the multiplying of values at two memory addresses to simply the multiplying of two constants, and the runtime of the associated f() drops significantly in gprofThreadfin
D
2

Edit: For vector of arbitrary child types mixed in together, I recommend going with the virtual call.


If, depending on config, there were a vector of only one child type - or if you can separate the different types into separate containers, then this could be a case where compile time polymorphism might be an option instead of runtime one. For example:

template<class Child, class Range>
void f_for(Range& r) {
    for (Parent* p : r) {
        Child* c = static_cast<Child*>(p);
        c->Child::f(); // use static dispatch to avoid virtual lookup
    }
}

...

if (config)
    f_for<Child1>(v);
else
    f_for<Child2>(v);

Alternative to explicit static dispatch would be to mark the child class or the member function final.

You might even expand the static portion of the program so that you get to use vector<Child1> or vector<Child2> directly, avoiding the extra indirection. At this point the inheritance is not even necessary.

Desiccator answered 21/3, 2020 at 0:34 Comment(6)
Thanks, but I apologize, I forgot to mention that the calls to f() for the vector of children has to be in order (please see my edit to my original post). Does that invalidate the solution you suggested?Threadfin
@Threadfin So, what you're saying is that the vector has pointers to several different child types, in arbitrary order? In that case, the virtual lookup is probably the way to go.Desiccator
That's right. The order of those pointers in v remains the same throughout the program though. There's no way to somehow save or cache the function pointers retrieved from the virtual table lookups?Threadfin
@Threadfin Let's assume you could: Why would looking up the cache be faster than looking up the virtual table?Desiccator
Well I was hoping in theory that one could just have a member variable, say void* g, in the Parent class which can hold the correct version of f() for the object in question, and then each iteration instead of calling v[i]->f() one can just call v[i]->g(). I assume that'd involve less overhead than whatever the virtual table lookup entails, but I could be completely wrongThreadfin
@Threadfin Feel free to try it out. If you can make it faster than a virtual call, then some people may be interested in a blog post about it.Desiccator

© 2022 - 2024 — McMap. All rights reserved.