As the first comment said, you have an XY problem here. Sorting / reordering is ok, and you have many objects, not a huge number of different classes, and no need to support types that your code doesn't know about at compile time. Polymorphism + virtual inheritance is the wrong choice.
Instead, use N different containers, one for each type of object, with no indirection. Letting the compiler inline B::Update()
into a loop over all the B
objects is much better. (For the trivial example below of incrementing one member int
, my static performance analysis from looking at the asm puts it at about 24x faster on Skylake with the data hot in L1D cache. AVX2 auto-vectorization vs. call
in a loop is really that huge.)
If there was some required order between the objects, including between different types of objects, then some kind of polymorphism or manual dispatching would be appropriate. (e.g. if it mattered what order you processed vecA
in, keeping all the B
objects separate from all the C
objects wouldn't be equivalent.)
If you care about performance, you have to realize that making the source larger might simplify things for the compiler / in the asm output. Checking / dispatching based on the type of each object inside the inner loop is expensive. Using any kind of function pointer or enum to dispatch on a per-object basis can easily suffer from branch mispredicts when you have a mix of different objects.
Looping separately over multiple containers effectively hoists that type check out of the inner loop and lets the compiler devirtualize. (Or better, shrinks each object to not need a vtable pointer, enum, or function pointer in the first place, because its type is statically known.)
Writing out a separate loop for each container with a different type is sort of like fully unrolling a loop over different types after hoisting the type dispatching out of the inner loop. This is necessary for the compiler to inline the calls, which you want if there are a lot of objects of each type. Inlining lets it keep constants in registers across objects, enables SIMD auto-vectorization across multiple objects, and simply avoids the overhead of an actual function call. (Both the call itself and spill/reload of registers.)
You were right that if you did need per-object dispatch, C++ virtual functions are an expensive way to get it when you're using final
overrides. You're paying the same runtime cost that would let your code support new derived classes of arbitrary size that it didn't know about at compile time, but not gaining any benefit from that.
Virtual dispatch only works with a level of indirection (e.g. a vector of pointers like you're using), which means you need to manage the pointed-to objects somehow, e.g. by allocating them from vector<B> poolB
and vector<C> poolC
. Although I'm not sure most implementations of vector<>
use realloc()
when they need to grow; the new/delete
API doesn't have a realloc
, so vector
may actually copy every time it grows, instead of trying to extend the existing allocation in place. Check what your C++ implementation does, since it might suck compared to what you can do with malloc/realloc.
And BTW, it should be possible to do the new
/delete
with RAII with no extra overhead for allocation/deallocation, as long as all your classes are trivially destructible. (But note that unique_ptr
may defeat other optimizations for using the vector of pointers). std::unique_ptr
warns that it's UB to destroy it via a pointer to the base class, so you might have to roll your own. Still, on gcc on x86-64, sizeof(unique_ptr<class C>)
is only 8, so it only has a single pointer member. But whatever, separately allocating zillions of tiny objects sucks so don't do it that way in the first place.
If you did need some kind of dispatch like the title asks for
If the objects are all similar sizes, then you really want to loop over objects, not pointers-to-objects. That would avoid the extra cache footprint of a vector of pointers, and it avoids the extra pointer-chasing latency that out-of-order execution has to hide to keep the execution units busy. But C++ virtual inheritance doesn't provide any standards-compliant way to get polymorphism for union upoly { B b; C c; } poly_array[1024];
You can hack this up yourself with reinterpret_cast<>
in a way that probably works on x86-64 gcc, but you probably shouldn't. See @BeeOnRope's followup: Contiguous storage of polymorphic types. (Also an older Q&A: C++ polymorphism of a object in an array).
If you need that, the highest-performance way would probably be to build it yourself with an enum
to index a table of function pointers (or use a switch()
if your functions can inline). If your functions don't inline, switch()
to a bunch of function-call case
s doesn't usually optimize down to a table of function pointers even if they all have the same args (or no args). You usually get a jump table to a block of call instructions, rather than actually doing an indirect call
. So there's an extra jump in every dispatch.
C++17 std::visit
with std::variant<B, C>
(using non-virtual inheritance for B and C) seems to give you dispatch based on an internal enum
. std::visit
uses its own jump table to dispatch, even with only 2 possible types, instead of inlining them both and using a conditional branch. It also has to check for the "uninitialized" state all the time. You can get good code if you manually work around that with B *tmp = std::get_if<B>(&my_variant)
, and a __builtin_unreachable()
to tell gcc that nullptr isn't a possibility. But at that point you might as well just roll your own struct polymorph { enum type; union { B b; C c; }; };
(with non-virtual functions) if you don't need an "uninitialized" state. Related: C++ polymorphism of a object in an array.
In this case you only have one function, so you can put a function pointer inside each object as a member. Like void (*m_update)(A* this_object)
. In the caller, pass a pointer to the object as a void*
or A*
, since it's a non-member function. The implementation of the function will reinterpret_cast<C*>(this_object)
. (Not dynamic_cast
: we're doing our own dispatchiing, not using C++'s).
If you want to use B and C in other contexts where the function-pointer member would be taking up space for no benefit, you can keep the function pointers in a separate container instead of in the base class. So it would be for(i=0..n) funcptrs[i]( &objects[i] );
. As long as your containers don't get out of sync, you're always passing a pointer to a function that knows what to do with it. Use that with union {B b; C c} objects[]
(or a vector<union>
).
You can use void*
if you want, especially if you make a separate array of function pointers. Then the union members don't need to inherit from a common base.
You could use std::function<>
to store pointers to instance member functions, but on x86-64 gcc that's a 32-byte object. It's better for your cache footprint to only use 8-byte regular function pointers and write code that knows to pass an explicit pointer equivalent to the this
pointer.
Putting a function pointer in each object may take more space than an enum
or uint8_t
, depending on current size/alignment. A small integer index into a table of function pointers might reduce the size of each instance of your objects vs. a pointer member, especially for 64-bit targets. Smaller objects could easily be worth the couple extra instructions to index an array of function pointers, and the possibly higher mispredict penalty from an extra pointer dereference. Memory / cache misses are often a bottleneck.
I'm assuming you do have some per-instance state, even though you don't show any. If not, then a vector of ordinary function pointers to non-member functions will be much cheaper!
Overhead comparison:
I had a look at the compiler-generated asm (gcc and clang targeting x86-64) for a few ways of doing this.
Source for multiple ways of doing this + asm from x86-64 clang 5.0 on the Godbolt compiler explorer. You can flip it over to gcc, or non-x86 architectures.
class A{
public:
virtual void Update() = 0; // A is so pure *¬*
};
struct C : public A {
int m_c = 0;
public:
void Update() override final
{ m_c++; }
};
int SC = sizeof(C); // 16 bytes because of the vtable pointer
C global_c; // to instantiate a definition for C::Update();
// not inheriting at all gives equivalent asm to making Update non-virtual
struct nonvirt_B //: public A
{
int m_b = 0;
void Update() //override final
{ m_b++; }
};
int SB = sizeof(nonvirt_B); // only 4 bytes per object with no vtable pointer
void separate_containers(std::vector<nonvirt_B> &vecB, std::vector<C> &vecC)
{
for(auto &b: vecB) b.Update();
for(auto &c: vecC) c.Update();
}
clang and gcc auto-vectorize the loop over vecB
with AVX2 to process 8 int
elements in parallel, so if you don't bottleneck on memory bandwidth (i.e. hot in L1D cache), this loop can increment 8 elements per clock cycle. This loop runs as fast as a loop over a vector<int>
; everything inlines and optimizes away and it's just a pointer increment.
The loop over vecC
can only do 1 element per clock cycle, because each object is 16 bytes (8 byte vtable pointer, 4 byte int m_c
, 4 bytes of padding to the next alignment boundary because the pointer has an 8B alignment requirement.) Without final
, the compiler also checks the vtable pointer to see if it's actually a C
before using the inlined C::Update()
, otherwise it dispatches. It's like what you'd get for a loop over struct { int a,b,c,d; } vecC[SIZE];
doing vecC[i].c++;
final
allowed full devirtualization, but our data is mixed with vtable pointers, so compilers just do scalar add [mem], 1
which can only run at 1 per clock (bottlenecked on 1 per clock store throughput, regardless of the size of the store if it's hot in L1D cache). This mostly defeats SIMD for this example. (With -march=skylake-avx512
, gcc and clang do some ridiculous shuffling or gather/scatter that's even slower than scalar, instead of just loading/restoring the whole object and adding with a vector that only changes the int
member. That's allowed because it doesn't contain any volatile or atomic members, and would run a 2 per clock with AVX2, or 4 per clock with AVX512.) Having your objects up to 12 bytes larger is a major downside if they're small and you have a lot of them.
With multiple members per object, this doesn't necessarily defeat SIMD, but it still costs space in each object, just like an enum or function pointer would.
Since you mentioned the separating axis theorem, I hope you're not planning on storing float x,y
pairs in each object. Array-of-structs basically sucks for SIMD, because it needs a lot of shuffling to use the x
with the y
for the same object. What you want is std::vector<float> x, y
or similar, so your CPU can load 4 x
values into a register and 4 y
values into another register. (Or 8 at a time with AVX).
See Slides: SIMD at Insomniac Games (GDC 2015) for an intro to how to structure your data for SIMD, and some more advanced stuff. See also the sse tag wiki for more guides. Also, the x86 tag wiki has lots of low-level x86 optimization material. Even if you don't manually vectorize anything, with separate arrays for x
and y
there's a good chance that the compiler can auto-vectorize for you. (Look at the asm output, or benchmark gcc -O3 -march=native
vs. gcc -O3 -march=native -fno-tree-vectorize
). You may need -ffast-math
for some kinds of FP vectorization.
C++ virtual dispatch:
Writing it the way you do in the question, with virtual inheritance and
std::vector<A*> vecA{};
void vec_virtual_pointers() {
for(auto a: vecA)
a->Update();
}
We get this loop from clang5.0 -O3 -march=skylake
# rbx = &vecA[0]
.LBB2_1: # do{
mov rdi, qword ptr [rbx] # load a pointer from the vector (will be the this pointer for Update())
mov rax, qword ptr [rdi] # load the vtable pointer
call qword ptr [rax] # memory-indirect call using the first entry in the vtable
add rbx, 8 # pointers are 8 bytes
cmp r14, rbx
jne .LBB2_1 # }while(p != vecA.end())
So the final function pointer is at the end of a chain of three dependent loads. Out-of-order execution lets this overlap between iterations (if the branch predicts correctly), but that's a lot of overhead just in total instructions for the front-end, as well as in mispredict penalty. (call [m]
is 3 uops, so just the loop itself is 8 uops, and can only issue one per 2 cycles on Skylake. Call/return has overhead too. If the callee is not totally trivial, we probably don't bottleneck on store-forwarding for pushing / popping the return address. Loop with function call faster than an empty loop. (I'm not sure about the throughput of independent store/reload operations on the same address. That would normally require memory renaming, which Skylake doesn't do, to not bottleneck on that if the callee is tiny like here.)
Clang's definition for C::Update() is
C::Update(): # @C::Update()
inc dword ptr [rdi + 8]
ret
If this needed to set up any constants before calculating something, it would be even more expensive to not have it inlined. So, with virtual dispatch, this probably runs at about one per 3 to 5 clocks, instead of about 1 member per clock, on Skylake. (Or 8 members per clock with AVX2 for non-virtual class B
which doesn't waste space, and makes auto-vectorization work well.) http://agner.org/optimize/ says Skylake has one per 3 clock call
throughput, so lets say 24x performance loss when the data is hot in L1D cache. Different microarchitectures will be different, of course. See the x86 tag wiki for more x86 perf info.
Union hack:
Probably you should never use this, but you can see from the asm that it will work on x86-64 with clang and gcc. I made an array of unions, and looped over it:
union upoly {
upoly() {} // needs an explicit constructor for compilers not to choke
B b;
C c;
} poly_array[1024];
void union_polymorph() {
upoly *p = &poly_array[0];
upoly *endp = &poly_array[1024];
for ( ; p != endp ; p++) {
A *base = reinterpret_cast<A*>(p);
base->Update(); // virtual dispatch
}
}
A B and C all have their vtable at the start, so I think this will generally work. We asm that's basically the same, with one less step of pointer-chasing. (I used a static array instead of a vector, since I was keeping things simple and C-like while sorting out what to cast.)
lea rdi, [rbx + poly_array] ; this pointer
mov rax, qword ptr [rbx + poly_array] ; load it too, first "member" is the vtable pointer
call qword ptr [rax]
add rbx, 16 ; stride is 16 bytes per object
cmp rbx, 16384 ; 16 * 1024
jne .LBB4_1
This is better, and touches less memory, but it's only slightly better for overhead.
std::function
from #include <functional>
It can hold any kind of callable thing. But it has even more overhead than vtable dispatch, because it's allowed to be in an error-if-used state. So the inner loop has to check every instance for that, and trap if it is. Also, sizeof(std::function<void()>);
is 32 bytes (on x86-64 System V ABI).
#include <functional>
// pretty crappy: checks for being possibly unset to see if it should throw().
std::vector<std::function<void()>> vecF{};
void vec_functional() {
for(auto f: vecF) f();
}
# do {
.LBB6_2: # =>This Inner Loop Header: Depth=1
mov qword ptr [rsp + 16], 0 # store a 0 to a local on the stack?
mov rax, qword ptr [rbx + 16]
test rax, rax
je .LBB6_5 # throw on pointer==0 (nullptr)
mov edx, 2 # third arg: 2
mov rdi, r14 # first arg: pointer to local stack memory (r14 = rsp outside the loop)
mov rsi, rbx # second arg: point to current object in the vector
call rax # otherwise call into it with 2 args
mov rax, qword ptr [rbx + 24] # another pointer from the std::function<>
mov qword ptr [rsp + 24], rax # store it to a local
mov rcx, qword ptr [rbx + 16] # load the first pointer again
mov qword ptr [rsp + 16], rcx
test rcx, rcx
je .LBB6_5 # check the first pointer for null again (and throw if null)
mov rdi, r14
call rax # call through the 2nd pointer
mov rax, qword ptr [rsp + 16]
test rax, rax
je .LBB6_12 # optionally skip a final call
mov edx, 3
mov rdi, r14
mov rsi, r14
call rax
.LBB6_12: # in Loop: Header=BB6_2 Depth=1
add rbx, 32
cmp r15, rbx
jne .LBB6_2
.LBB6_13: # return
add rsp, 32
pop rbx
pop r14
pop r15
ret
.LBB6_5:
call std::__throw_bad_function_call()
jmp .LBB6_16
mov rdi, rax
call __clang_call_terminate
So there are up to three call
instructions unless the pointer is nullptr. This looks far worse than virtual dispatch.
It looks a bit different with clang -stdlib=libc++
, instead of the default libstdc++
. (https://libcxx.llvm.org/). But still three call
instructions in the inner loop, with conditionals to skip them or throw.
Unless the code-gen is very different for different kinds of function<T>
, it's probably not even worth looking at it for pointers to member functions if you can write a more efficient alternative.
vecA
? (i.e. would it be useful to shrink each object by replacing a vtable function pointer with a small integer index?) Are you tuning this for modern big-core x86, or for a microcontroller with weak indirect-branch prediction? – Lanthanumstd::vector<std::function<void()>>
and not have to worry about memory management or inheritance. If you worry about performance, it is worth looking adboost::variant
orstd::variant
if your compiler supports it. – InningDO B STUFF
andDO C STUFF
etc. similar? You could possibly reduce the mispredict penalty by computing the branch destination instead of loading it (if each block of machine code is a constant size, thenbase + idx*size
/jmp reg
). Hrm, but not really because you'd be computing based on something you loaded from the vector, so loading a pointer is going to have lower latency. – LanthanumvecA
to somewhat cluster the objects by type, so the code ofDO B STUFF
stays hot in the uop cache and branch predictors? How big isDO B STUFF
in uops, approximately? Or relative to your L1I cache if you're not tuning for an x86 CPU with a uop cache. – Lanthanum// Free memory
-- no. Let RAII take care of that. – AppointedB::Update()
into a loop overvector<B>
is much better than any kind of indirect function call. It can even auto-vectorize with SSE / AVX: godbolt.org/g/tL95NX – Lanthanum