Can you cache a virtual function lookup in C++?
Asked Answered
C

9

40

Say I have a virtual function call foo() on an abstract base class pointer, mypointer->foo(). When my app starts up, based on the contents of a file, it chooses to instantiate a particular concrete class and assigns mypointer to that instance. For the rest of the app's life, mypointer will always point to objects of that concrete type. I have no way to know what this concrete type is (it may be instantiated by a factory in a dynamically loaded library). I only know that the type will stay the same after the first time an instance of the concrete type is made. The pointer may not always point to the same object, but the object will always be of the same concrete type. Notice that the type is technically determined at 'runtime' because it's based on the contents of a file, but that after 'startup' (file is loaded) the type is fixed.

However, in C++ I pay the virtual function lookup cost every time foo is called for the entire duration of the app. The compiler can't optimize the look up away because there's no way for it to know that the concrete type won't vary at runtime (even if it was the most amazing compiler ever, it can't speculate on the behavior of dynamically loaded libraries). In a JIT compiled language like Java or .NET the JIT can detect that the same type is being used over and over and do inline cacheing. I'm basically looking for a way to manually do that for specific pointers in C++.

Is there any way in C++ to cache this lookup? I realize that solutions might be pretty hackish. I'm willing to accept ABI/compiler specific hacks if it's possible to write configure tests that discover the relevant aspects of the ABI/compiler so that it's "practically portable" even if not truly portable.

Update: To the naysayers: If this wasn't worth optimizing, then I doubt modern JITs would do it. Do you think Sun and MS's engineers were wasting their time implementing inline cacheing, and didn't benchmark it to ensure there was an improvement?

Capillary answered 26/1, 2010 at 18:43 Comment(16)
it would be interesting to see if LLVM can do the JIT trick on this...Nippur
Is shaving one extra indirection worth all the hackishness this would entail? That sounds quite hardcore. I can think of two ways to do it: 1. Patch all calls to the virtual function with the resolved address, in the loaded-in object code. You may be able to hack the linker to do this for you. 2. Use trampolines. But I don't know if that would have the same overhead as function pointers, or even more. Try both, and measure and see. :-PBradney
Why do you believe that the virtual function lookup cost is even worth optimizing? Remember, "Premature optimization is the root of all evil".Parallel
I wonder how much of your processing time is being lost in the function dispatch... have you meassured?Balustrade
you are wasting your time, this isn't a performance concern on modern processorsScifi
There is no "lookup cost", it is just an indirect call.Japan
There is a lookup cost. You have to load the vtable pointer, add an offset to the loaded pointer, then the computed value. A normal indirect call is 1 load. Also, I've updated my answer for those of you that think it's a waste of time.Capillary
@Joseph: You misunderstand. On modern CPU's, there is really no cost. We use virtual functions in and around our code rendering engine, and it performs top-notch. Why would JIT do it then? Because it's helpful. 99% guarantee versus 100% guarantee. You're wasting your time until you've finished your code and then profiled it. Not guessed from the get-go.Intersidereal
@GMan: So long as we've established that it can be valuable (the 99% versus 100% guarantee) then my question is still the same: Can this be done in C++?Capillary
"So long as we've established that it can be valuable" For very low definitions of valuable. :) "Can this be done in C++" I don't know. Honest question: Have you profiled the usage of your virtual functions and found most of your time was spent there?Intersidereal
@GMan, on modern CPU, virtual call is still a problem. I agree that this is small, but you can't simply say no cost. It's definitely wrong.Richma
@minjang: I didn't say strictly no cost, I said "really no cost". :P I meant compared to the multitude of other things going on, this is the least of the worries.Intersidereal
@Joseph: For a language like Java, where all functions are virtual, it's worth optimizing and inlining them. For C++, a single virtual lookup will NOT be the bottleneck, unless it's preventing the code from otherwise being inline. I say this as a micro-optimizer myself.Broomrape
@Tom: If there were a general solution to this problem I wouldn't apply it only in one place. I've only described it as a single instance to distill the essence of the problem, which is that C++ has a good compile/run time distinction for efficiency but not a good compile/start/run distinction.Capillary
I put a simple call to a virtual operator[] member function in a tight loop (rather than a non-virtual one) and it blew the execution time out by almost a factor of 3 in release code (LLVM clang 3.1). Caching this virtual member function might suggest a big improvement, surely? Perhaps because the non-virtual function can be inlined? So I compared with inlining disabled, and it's still 1.5x slower.Tipster
Anyway, the chosen answer doesn't actually answer the question.Singe
M
42

There are two costs to a virtual function call: The vtable lookup and the function call.

The vtable lookup is already taken care of by the hardware. Modern CPUs (assuming you're not working on a very simple embedded CPU) will predict the address of the virtual function in their branch predictor and speculatively execute it in parallel with the array lookup. The fact that the vtable lookup happens in parallel with the speculative execution of the function means that, when executed in a loop in the situations you describe, virtual function calls have next to zero overhead compared to direct, non-inlined function calls.

I've actually tested this in the past, albeit in the D programming language, not C++. When inlining was disabled in the compiler settings and I called the same function in a loop several million times, the timings were within epsilon of each other whether the function was virtual or not.

The second and more important cost of virtual functions is that they prevent inlining of the function in most cases. This is even more important than it sounds because inlining is an optimization that can enable several other optimizations such as constant folding in some cases. There's no way to inline a function without recompiling the code. JITs get around this because they're constantly recompiling code during the execution of your application.

Monody answered 26/1, 2010 at 18:52 Comment(11)
Do we even have to worry about the vtable lookup in a loop situation? I'm thinking that in a loop, where the object pointer doesn't change, won't the compiler optimize the vtable lookup out of the loop? If the object pointer doesn't change, then the object (and it's type and vtable) can't change, so the results of the vtable lookup can't change. This isn't an app-wide optimization, but if it's doing it, then each loop will only have to do the lookup once, which for most apps should be more than good enough.Mayan
@Michael Kohne: I can't speak for every compiler, but based on reading disassemblies from the Digital Mars D compiler, it doesn't look like this happens. In theory, one could put the function pointer in a register, etc. I actually hacked together some assembly language code to do this one time and it wasn't any faster, probably because with speculative execution you're effectively not paying for the array lookup anyhow.Monody
In all other cases, the vtable will stay cached and the cost of going into the cache is next to nothing. Instead of trying to optimize cache hits, you should focus on cache misses. Every single cache miss will block the CPU for hundreds of cycles.Ginnygino
Your benchmark was not realistic. How well does speculative execution fare when the branch predictor actually has other things to worry about? When we're several layers of conditionally indirect calls deep? Also, your answer technically doesn't address the question.Capillary
@Joseph Garvin: If you've got that much other logic between executions of a given call (e.g. if the call is in an outer loop, not an inner loop, or if the function you're calling calls several other functions) then that virtual function call is not going to be a practically important bottleneck even if it's mispredicted every time.Monody
My argument is that if you have virtual calls making virtual calls, then at some point the constant overhead has to sum to a significant amount. I can see this happening in a dependency injection framework, where one injected piece call a virtual method on another injected piece, ad nauseum. After some depth you have to exhaust the speculators ability to execute things in parallel.Capillary
@dsimcha: Why do modern VMs do this if it is never an optimization?Capillary
@Joseph Garvin: Probably because they also want to inline the function.Monody
@dsimchar, you said that you called the same function in a loop. In this case, even if the function is virtual, CPU can perfectly predict the branch target. So, you saw very small difference. However, if the target is hard to predict, then it is quite different.Richma
@minjang: Right, if the target is hard to predict, you get different results. However, unpredictable branches of any kind are a nightmare for micro-optimization, and usually represent essential complexity in the code. The OP's question implied that the branches were predictable.Monody
@dsimchar, even the targets are easy to predict, modern CPUs may not handle well for indirect branches. Direct branches are quite well predict. However, if the targets are correlated with some patterns, then it's still hard to predict. And it is still on-going work for CPU architects.Richma
R
20

Why virtual call is expensive? Because you simply don't know the branch target until the code is executed in runtime. Even modern CPUs are still perfectly handling the virtual call and indirect calls. One can't simply say it costs nothing because we just have a faster CPU. No, it is not.

1. How can we make it fast?

You already have pretty deep understanding the problem. But, the only I can say that if the virtual function call is easy to predict, then you could perform software-level optimization. But, if it's not (i.e., you have really no idea what would be the target of the virtual function), then I don't think that there is good solution for now. Even for CPU, it is hard to predict in such extreme case.

Actually, compilers such as Visual C++'s PGO(Profiling guided optimization) has virtual call speculation optimization (Link). If the profiling result can enumerate hot virtual function targets, then it translate to direct call which can be inlined. This is also called devirtualization. It can be also found in some Java dynamic optimizer.

2. To those one who say it's not necessary

If you're using script languages, C# and concern about the coding efficiency, yes, it's worthless. However, anyone who are eager to save a single cycle to obtain better performance, then indirect branch is still important problem. Even the latest CPUs are not good to handle virtual calls. One good example would be a virtual machine or interpreter, which usually have a very large switch-case. Its performance is pretty much related to the correct prediction of indirect branch. So, you can't simply say it's too low-level or not necessary. There are hundreds of people who are trying to improve the performance in the bottom. That's why you can simply ignore such details :)

3. Some boring computer architectural facts related to virtual functions

dsimcha has written a good answer for how CPU can handle virtual call effectively. But, it's not exactly correct. First, all modern CPUs have branch predictor, which literally predicts the outcomes of a branch to increase pipeline throughput (or, more parallelism in instruction level, or ILP. I can even say that single-thread CPU performance is solely depending on how much you can extract ILP from a single thread. Branch prediction is the most critical factor for obtaining higher ILP).

In branch prediction, there are two predictions: (1) direction (i.e., the branch is taken? or not taken? binary answer), and (2) branch target (i.e., where will I go? it's not binary answer). Based on the prediction, CPU speculatively execute the code. If the speculation is not correct, then CPU rollbacks and restarts from the mis-predicted branch. This is completely hidden from programmer's view. So, you don't really know what's going on inside the CPU unless you're profiling with VTune which gives branch misprediction rates.

In general, branch direction prediction is highly accurate(95%+), but it is still hard to predict branch targets, especially virtual calls and switch-case(i.e., jump table). Vrtual call is indirect branch which requires a more memory load, and also CPU requires branch target prediction. Modern CPUs like Intel's Nehalem and AMD's Phenom have specialized indirect branch target table.

However, I don't think looking up vtable incurs a lot of overhead. Yes, it requires a more memory load which can make cache miss. But, once vtable is loaded into cache, then it's pretty much cache hit. If you're also concerned with that cost, you may put prefetching code to load vtable in advance. But, the real difficulty of virtual function call is that CPU can't do great job to predict the target of virtual call, which may result in pipeline drain frequently due to misprediction of the target.

Richma answered 26/1, 2010 at 19:48 Comment(0)
I
4

So assuming that this is a fundamental issue you want to solve (to avoid premature optimization arguments), and ignoring platform and compiler specific hackery, you can do one of two things, at opposite ends of complexity:

  1. Provide a function as part of the .dll that internally simply calls the right member function directly. You pay the cost of an indirect jump, but at least you don't pay the cost of a vtable lookup. Your mileage may vary, but on certain platforms, you can optimize the indirect function call.
  2. Restructure your application such that instead of calling a member function per instance, you call a single function that takes a collection of instances. Mike Acton has a wonderful post (with a particular platform and application type bent) on why and how you should do this.
Innocence answered 26/1, 2010 at 18:52 Comment(0)
O
4

All answers are dealing with the most simple scenario, where calling a virtual method only requires getting the address of the actual method to call. In the general case, when multiple and virtual inheritance come into play, calling a virtual method requires shifting the this pointer.

The method dispatch mechanism can be implemented in more than one way, but it is common to find that the entry in the virtual table is not the actual method to call, but rather some intermediate 'trampoline' code inserted by the compiler that relocates the this pointer prior to calling the actual method.

When the dispatch is the simplest, just an extra pointer redirection, then trying to optimize it does not make sense. When the problem is more complex, then any solution will be compiler dependent and hackerish. Moreover, you do not even know in what scenario you are: if the objects are loaded from dlls then you don't really know whether the actual instance returned belongs to a simple linear inheritance hierarchy or a more complex scenario.

Overfly answered 26/1, 2010 at 19:28 Comment(0)
M
3

So, what you basically want to do is convert runtime polymorphism into compile time polymorphism. Now you still need to build your app so that it can handle multiple "cases", but once it's decided which case is applicable to a run, that's it for the duration.

Here's a model of the runtime polymorphism case:

struct Base {
  virtual void doit(int&)=0;
};

struct Foo : public Base {
  virtual void doit(int& n) {--n;}
};

struct Bar : public Base {
  virtual void doit(int& n) {++n;}
};

void work(Base* it,int& n) {
  for (unsigned int i=0;i<4000000000u;i++) it->doit(n);
}

int main(int argc,char**) {
  int n=0;

  if (argc>1)
    work(new Foo,n);
  else
    work(new Bar,n);

  return n;
}

This takes ~14s to execute on my Core2, compiled with gcc 4.3.2 (32 bit Debian), -O3 option.

Now suppose we replace the "work" version with a templated version (templated on the concrete type it's going to be working on):

template <typename T> void work(T* it,int& n) {
  for (unsigned int i=0;i<4000000000u;i++) it->T::doit(n);
}

main doesn't actually need to be updated, but note that the 2 calls to work now trigger instantiations of and calls to two different and type-specific functions (c.f the one polymorphic function previously).

Hey presto runs in 0.001s. Not a bad speed up factor for a 2 line change! However, note that the massive speed up is entirely due to the compiler, once the possibility of runtime polymorphism in the work function is eliminated, just optimizing away the loop and compiling the result directly into the code. But that actually makes an important point: in my experience the main gains from using this sort of trick come from the opportunities for improved inlining and optimisation they allow the compiler when a less-polymorphic, more specific function is generated, not from the mere removal of vtable indirection (which really is very cheap).

But I really don't recommend doing stuff like this unless profiling absolutely indicates runtime polymorphism is really hitting your performance. It'll also bite you as soon as someone subclasses Foo or Bar and tries to pass that into a function actually intended for its base.

You might find this related question interesting too.

Monarda answered 26/1, 2010 at 22:39 Comment(2)
I agree that better optimization from inlining can be very useful. However, for a fair analysis, you need to distinguish the benefit from avoiding indirect (virtual) function calls and that from optimization combined with inlining, because you don't always get both. You need to look at the assembly code to see what actually happened.Participation
I'm a huge fan of CRTP, but I'll be the first to admit I've wasted far too much time trying to avoid RT polymorphism. Re. profiling: I think a lot of the people asking aren't really concerned about grinding a specific program to perfection as much as they've fixated on a hidden cost they don't understand. Isolating and studying it is a great reaction; disappointment follows, but the time is well spent.Ias
M
2

I have seen situations where avoiding a virtual function call is beneficial. This does not look to me to be one of those cases because you really are using the function polymorphically. You are just chasing one extra address indirection, not a huge hit, and one that might be partially optimized away in some situations. If it really does matter, you may want to restructure your code so that type-dependent choices such as virtual function calls are made fewer times, pulled outside of loops.

If you really think it's worth giving it a shot, you can set a separate function pointer to a non-virtual function specific to the class. I might (but probably wouldn't) consider doing it this way.

class MyConcrete : public MyBase
{
public:
  static void foo_nonvirtual(MyBase* obj);
  virtual void foo()
  { foo_nonvirtual(this); }
};

void (*f_ptr)(MyBase* obj) = &MyConcrete::foo_nonvirtual;
// Call f_ptr instead of obj->foo() in your code.
// Still not as good a solution as restructuring the algorithm.

Other than making the algorithm itself a bit wiser, I suspect any attempt to manually optimize the virtual function call will cause more problems than it solves.

Mesics answered 26/1, 2010 at 19:5 Comment(2)
"This does not look to me to be one of those cases because you really are using the function polymorphically." <-- Sort of. It's polymorphic until startup is over, but monomorphic after that.Capillary
This is just writing the vtable manually. Looks like a C dispatch table: stackoverflow.com/questions/62027943 .Tallman
I
2

You can't use a method pointer because pointers to member functions aren't considered covariant return types. See the example below:

#include <iostream>

struct base;
struct der;

typedef void(base::*pt2base)();
typedef void(der::*pt2der)();

struct base {
    virtual pt2base method() = 0;
    virtual void testmethod() = 0;
    virtual ~base() {}
};

struct der : base {
    void testmethod() {
        std::cout << "Hello from der" << std::endl;
    }
    pt2der method() { **// this is invalid because pt2der isn't a covariant of pt2base**
        return &der::testmethod;
    }
};

The other option would be to have the method declared pt2base method() but then the return would be invalid because der::testmethod is not of type pt2base.

Also even if you had a method that received a ptr or reference to the base type you would have to dynamically cast it to the derived type in that method to do anything particularly polymorphic which adds back in the cost we're trying to save.

Ido answered 26/1, 2010 at 20:34 Comment(1)
They can be providing they share a common base (doesn't need to be a polymorphic base), see stackoverflow.com/questions/37662100 . This would essentially be writing the vtable manually though.Tallman
M
2

I asked a very similar question recently, and got the answer that it's possible as a GCC extension, but not portably:

C++: Pointer to monomorphic version of virtual member function?

In particular, I also tried it with Clang and it doesn't support this extension (even though it supports many other GCC extensions).

Marmoreal answered 19/3, 2011 at 12:7 Comment(0)
F
0

Could you use a method pointer?

The objective here is that the compiler would load the pointer with the location of the resolved method or function. This would occur once. After the assignment, the code would access the method in a more direct fashion.

I know that a pointer to an object and accessing the method via the object point invokes run-time polymorphism. However, there should be a way to load a method pointer to a resolved method, avoiding the polymorphism and directly calling the function.

I've checked the community wiki to introduce more discussion.

Finespun answered 26/1, 2010 at 18:43 Comment(1)
This has the same problem that most other answers: the compiler does not only need to determine the actual method (block of code) to call, but also modify the this pointer accordingly. The scenario is not as simple as most people consider here.Balustrade

© 2022 - 2024 — McMap. All rights reserved.