Fastest implementation of simple, virtual, observer-sort of, pattern in c++?

Asked 5/10, 2017 at 7:4 Answered 9/10, 2017 at 1:51

Solved c++enums virtual-functions dispatch micro-optimization

I'm working my arse off trying to implement an alternative for vtables using enums and a ton of macro magic that's really starting to mess with my brain. I'm starting to think i'm not walking the right path since the code is getting uglier and uglier, and will not be fit for production by any means.

How can the pattern of the following code be implemented with the least amount of redirection/operations?

It has to be done in standard c++, up to 17.

class A{
    virtual void Update() = 0; // A is so pure *¬*
};

class B: public A
{
    override void Update() final
    {
        // DO B STUFF
    }
}

class C: public A
{
    override void Update() final
    {
        // DO C STUFF
    }
}

// class...

int main()
{
    std::vector<A*> vecA{};

    // Insert instances of B, C, ..., into vecA

    for(auto a: vecA) // This for will be inside a main loop
        a->Update(); // Ridiculous amount of calls per unit of time

    // Free memory
}

PS: If enum, switch and macros are indeed the best option, I guess I'll simply try to freshen up my caches and come up with a better design.

PSS: I know this is micro-optimization... Heck, I need to nano or even pico optimize this (figuratively speaking), so I will simply ignore any utilitarian responses that might come up.

Purehearted answered 5/10, 2017 at 7:4 Comment(16)

The most important question we need to ask first is: Why? Why do you want alternatives? What is the actual problem you try to solve by doing that? Have you actually measured that this is a problem? A bottleneck in your system? Because if you haven't, then don't do this! This has a smell of premature optimizations, and those are general very bad. And of course, please read about the XY problem. – Edrei 5/10, 2017 at 7:7

What exactly is the bottleneck in your current code? Are you getting a lot of branch mispredicts from the indirect calls? (you did profile with performance counters, right?) Are you getting cache misses from reading vecA? (i.e. would it be useful to shrink each object by replacing a vtable function pointer with a small integer index?) Are you tuning this for modern big-core x86, or for a microcontroller with weak indirect-branch prediction? – Lanthanum 5/10, 2017 at 7:8

You could simply use std::vector<std::function<void()>> and not have to worry about memory management or inheritance. If you worry about performance, it is worth looking ad boost::variant or std::variant if your compiler supports it. – Inning 5/10, 2017 at 7:14

So there are more than 2 possible function pointers you need to pick from? Is the code-size of DO B STUFF and DO C STUFF etc. similar? You could possibly reduce the mispredict penalty by computing the branch destination instead of loading it (if each block of machine code is a constant size, then base + idx*size / jmp reg). Hrm, but not really because you'd be computing based on something you loaded from the vector, so loading a pointer is going to have lower latency. – Lanthanum 5/10, 2017 at 7:16

Can you spend some CPU time sorting vecA to somewhat cluster the objects by type, so the code of DO B STUFF stays hot in the uop cache and branch predictors? How big is DO B STUFF in uops, approximately? Or relative to your L1I cache if you're not tuning for an x86 CPU with a uop cache. – Lanthanum 5/10, 2017 at 7:18

// Free memory -- no. Let RAII take care of that. – Appointed 5/10, 2017 at 7:36

@Someprogrammerdude Did you read the PSS? – Purehearted 5/10, 2017 at 14:57

@Someprogrammerdude The bottleneck is theoretical, this is a hypothetical problem, I'm actually studying this problem and can't find any material on the internet, I have no books. Of course this is premature optimization, It's on purpose! This is not something so terrible to ask top downvote I believe – Purehearted 5/10, 2017 at 14:59

@PeterCordes Thank you very much for your questions, they were all very insightful. The tuning and profiling will be done after I know I'm doing this the fastest way this can be THOUGHT of in c++ standard. This is a hypothetical problem, I didn't imagine people would still downvote. I'm clearly involved with the matter and had done all previous research I could! There's a single clear question, idk. Maybe I'm wrong. Thank you very much for the help, It gave me a little bit more perspective on what would I do after I solve this to improve it even more. – Purehearted 5/10, 2017 at 15:5

@PeterCordes Your second and third comment sound more like a solution! I've actually thought of making pools of objects of the same kind. It would add an overhead to the containing class, of which there are ridiculous amount of instances. Maybe this can be done with an external container, maybe an external map? And yes, I can spend some time sorting vecA. DO B STUFF is as big as separate axis theorem, or less since I'll see if I can learn a better algorithm soon. – Purehearted 5/10, 2017 at 15:12

@Appointed There are ridiculous amount of instances in my container, that's not the point of the problem anyways. I shouldn't, but to answer your concerns I don't want to overhead of a smart pointer, that's about it. – Purehearted 5/10, 2017 at 15:13

@NicoBerrogorry: You say: profiling will be done after I know I'm doing this the fastest way this can be THOUGHT of in c++ Computer performance doesn't work that way. Context / problem-size can easily change what the fastest way is. Caches, branch predictor history, and what's already in flight in the out-of-order CPU pipeline can affect different ways differently. Although there's probably no way the implementation in the question is ever optimal, since it has two levels of indirection (vector of pointers, and then each object has a pointer to its vtable.) – Lanthanum 5/10, 2017 at 19:44

@PeterCordes So there's a lot of engineering to be done to decide what's the best implementation, and there is no better way in general to do this. I can see now there's a whole lot about processors that I've got to study. I must also research about free/cheap profiling software. I believe the right way to approach this, now, is to implement this using virtual inheritance which is the "solution provided by the language", and meanwhile research other possible implementations for my personal toolbox, so eventually when the bottleneck appears I'll have a variety of approaches available. Ty Peter. – Purehearted 5/10, 2017 at 20:9

@NicoBerrogorry: Actually I think because you don't need any specific order between objects of different classes, separate containers for separate classes, with non-virtual inheritance if you inherit at all, is the way to go. It might make the source larger, but at run-time what you're doing is letting the compiler hoist the type-determination out of the loop (and optimize it away completely). Inlining B::Update() into a loop over vector<B> is much better than any kind of indirect function call. It can even auto-vectorize with SSE / AVX: godbolt.org/g/tL95NX – Lanthanum 5/10, 2017 at 20:22

You're amazing Peter, you gave me months of stuff to study. The best solution to me is the comment above. You can put it in an answer so I can accept it, otherwise I can write a compilation of every bit of help here, explaining why juanchopanza's answer is not so good, etc. Tell me what you prefer. – Purehearted 6/10, 2017 at 3:38

I may have got somewhat carried away writing it... :P Also, now I have skylake-avx512 auto-vectorization bugs to submit to gcc and clang. :/ Silly compilers. – Lanthanum 6/10, 2017 at 11:20

As the first comment said, you have an XY problem here. Sorting / reordering is ok, and you have many objects, not a huge number of different classes, and no need to support types that your code doesn't know about at compile time. Polymorphism + virtual inheritance is the wrong choice.

Instead, use N different containers, one for each type of object, with no indirection. Letting the compiler inline B::Update() into a loop over all the B objects is much better. (For the trivial example below of incrementing one member int, my static performance analysis from looking at the asm puts it at about 24x faster on Skylake with the data hot in L1D cache. AVX2 auto-vectorization vs. call in a loop is really that huge.)

If there was some required order between the objects, including between different types of objects, then some kind of polymorphism or manual dispatching would be appropriate. (e.g. if it mattered what order you processed vecA in, keeping all the B objects separate from all the C objects wouldn't be equivalent.)

If you care about performance, you have to realize that making the source larger might simplify things for the compiler / in the asm output. Checking / dispatching based on the type of each object inside the inner loop is expensive. Using any kind of function pointer or enum to dispatch on a per-object basis can easily suffer from branch mispredicts when you have a mix of different objects.

Looping separately over multiple containers effectively hoists that type check out of the inner loop and lets the compiler devirtualize. (Or better, shrinks each object to not need a vtable pointer, enum, or function pointer in the first place, because its type is statically known.)

Writing out a separate loop for each container with a different type is sort of like fully unrolling a loop over different types after hoisting the type dispatching out of the inner loop. This is necessary for the compiler to inline the calls, which you want if there are a lot of objects of each type. Inlining lets it keep constants in registers across objects, enables SIMD auto-vectorization across multiple objects, and simply avoids the overhead of an actual function call. (Both the call itself and spill/reload of registers.)

You were right that if you did need per-object dispatch, C++ virtual functions are an expensive way to get it when you're using final overrides. You're paying the same runtime cost that would let your code support new derived classes of arbitrary size that it didn't know about at compile time, but not gaining any benefit from that.

Virtual dispatch only works with a level of indirection (e.g. a vector of pointers like you're using), which means you need to manage the pointed-to objects somehow, e.g. by allocating them from vector<B> poolB and vector<C> poolC. Although I'm not sure most implementations of vector<> use realloc() when they need to grow; the new/delete API doesn't have a realloc, so vector may actually copy every time it grows, instead of trying to extend the existing allocation in place. Check what your C++ implementation does, since it might suck compared to what you can do with malloc/realloc.

And BTW, it should be possible to do the new/delete with RAII with no extra overhead for allocation/deallocation, as long as all your classes are trivially destructible. (But note that unique_ptr may defeat other optimizations for using the vector of pointers). std::unique_ptr warns that it's UB to destroy it via a pointer to the base class, so you might have to roll your own. Still, on gcc on x86-64, sizeof(unique_ptr<class C>) is only 8, so it only has a single pointer member. But whatever, separately allocating zillions of tiny objects sucks so don't do it that way in the first place.

If you did need some kind of dispatch like the title asks for

If the objects are all similar sizes, then you really want to loop over objects, not pointers-to-objects. That would avoid the extra cache footprint of a vector of pointers, and it avoids the extra pointer-chasing latency that out-of-order execution has to hide to keep the execution units busy. But C++ virtual inheritance doesn't provide any standards-compliant way to get polymorphism for union upoly { B b; C c; } poly_array[1024]; You can hack this up yourself with reinterpret_cast<> in a way that probably works on x86-64 gcc, but you probably shouldn't. See @BeeOnRope's followup: Contiguous storage of polymorphic types. (Also an older Q&A: C++ polymorphism of a object in an array).

If you need that, the highest-performance way would probably be to build it yourself with an enum to index a table of function pointers (or use a switch() if your functions can inline). If your functions don't inline, switch() to a bunch of function-call cases doesn't usually optimize down to a table of function pointers even if they all have the same args (or no args). You usually get a jump table to a block of call instructions, rather than actually doing an indirect call. So there's an extra jump in every dispatch.

C++17 std::visit with std::variant<B, C> (using non-virtual inheritance for B and C) seems to give you dispatch based on an internal enum. std::visit uses its own jump table to dispatch, even with only 2 possible types, instead of inlining them both and using a conditional branch. It also has to check for the "uninitialized" state all the time. You can get good code if you manually work around that with B *tmp = std::get_if<B>(&my_variant), and a __builtin_unreachable() to tell gcc that nullptr isn't a possibility. But at that point you might as well just roll your own struct polymorph { enum type; union { B b; C c; }; }; (with non-virtual functions) if you don't need an "uninitialized" state. Related: C++ polymorphism of a object in an array.

In this case you only have one function, so you can put a function pointer inside each object as a member. Like void (*m_update)(A* this_object). In the caller, pass a pointer to the object as a void* or A*, since it's a non-member function. The implementation of the function will reinterpret_cast<C*>(this_object). (Not dynamic_cast: we're doing our own dispatchiing, not using C++'s).

If you want to use B and C in other contexts where the function-pointer member would be taking up space for no benefit, you can keep the function pointers in a separate container instead of in the base class. So it would be for(i=0..n) funcptrs[i]( &objects[i] );. As long as your containers don't get out of sync, you're always passing a pointer to a function that knows what to do with it. Use that with union {B b; C c} objects[] (or a vector<union>).

You can use void* if you want, especially if you make a separate array of function pointers. Then the union members don't need to inherit from a common base.

You could use std::function<> to store pointers to instance member functions, but on x86-64 gcc that's a 32-byte object. It's better for your cache footprint to only use 8-byte regular function pointers and write code that knows to pass an explicit pointer equivalent to the this pointer.

Putting a function pointer in each object may take more space than an enum or uint8_t, depending on current size/alignment. A small integer index into a table of function pointers might reduce the size of each instance of your objects vs. a pointer member, especially for 64-bit targets. Smaller objects could easily be worth the couple extra instructions to index an array of function pointers, and the possibly higher mispredict penalty from an extra pointer dereference. Memory / cache misses are often a bottleneck.

I'm assuming you do have some per-instance state, even though you don't show any. If not, then a vector of ordinary function pointers to non-member functions will be much cheaper!

Overhead comparison:

I had a look at the compiler-generated asm (gcc and clang targeting x86-64) for a few ways of doing this.

Source for multiple ways of doing this + asm from x86-64 clang 5.0 on the Godbolt compiler explorer. You can flip it over to gcc, or non-x86 architectures.

class A{
    public:
    virtual void Update() = 0; // A is so pure *¬*
};

struct C : public A {
    int m_c = 0;
    public:
    void Update() override final
    {  m_c++;  }
};
int SC = sizeof(C);  // 16 bytes because of the vtable pointer

C global_c;  // to instantiate a definition for C::Update();

// not inheriting at all gives equivalent asm to making Update non-virtual
struct nonvirt_B //: public A
{
    int m_b = 0;
    void Update() //override final
    {  m_b++;  }
};
int SB = sizeof(nonvirt_B);  // only 4 bytes per object with no vtable pointer

void separate_containers(std::vector<nonvirt_B> &vecB, std::vector<C> &vecC)
{
    for(auto &b: vecB)        b.Update();
    for(auto &c: vecC)        c.Update();   
}

clang and gcc auto-vectorize the loop over vecB with AVX2 to process 8 int elements in parallel, so if you don't bottleneck on memory bandwidth (i.e. hot in L1D cache), this loop can increment 8 elements per clock cycle. This loop runs as fast as a loop over a vector<int>; everything inlines and optimizes away and it's just a pointer increment.

The loop over vecC can only do 1 element per clock cycle, because each object is 16 bytes (8 byte vtable pointer, 4 byte int m_c, 4 bytes of padding to the next alignment boundary because the pointer has an 8B alignment requirement.) Without final, the compiler also checks the vtable pointer to see if it's actually a C before using the inlined C::Update(), otherwise it dispatches. It's like what you'd get for a loop over struct { int a,b,c,d; } vecC[SIZE]; doing vecC[i].c++;

final allowed full devirtualization, but our data is mixed with vtable pointers, so compilers just do scalar add [mem], 1 which can only run at 1 per clock (bottlenecked on 1 per clock store throughput, regardless of the size of the store if it's hot in L1D cache). This mostly defeats SIMD for this example. (With -march=skylake-avx512, gcc and clang do some ridiculous shuffling or gather/scatter that's even slower than scalar, instead of just loading/restoring the whole object and adding with a vector that only changes the int member. That's allowed because it doesn't contain any volatile or atomic members, and would run a 2 per clock with AVX2, or 4 per clock with AVX512.) Having your objects up to 12 bytes larger is a major downside if they're small and you have a lot of them.

With multiple members per object, this doesn't necessarily defeat SIMD, but it still costs space in each object, just like an enum or function pointer would.

Since you mentioned the separating axis theorem, I hope you're not planning on storing float x,y pairs in each object. Array-of-structs basically sucks for SIMD, because it needs a lot of shuffling to use the x with the y for the same object. What you want is std::vector<float> x, y or similar, so your CPU can load 4 x values into a register and 4 y values into another register. (Or 8 at a time with AVX).

See Slides: SIMD at Insomniac Games (GDC 2015) for an intro to how to structure your data for SIMD, and some more advanced stuff. See also the sse tag wiki for more guides. Also, the x86 tag wiki has lots of low-level x86 optimization material. Even if you don't manually vectorize anything, with separate arrays for x and y there's a good chance that the compiler can auto-vectorize for you. (Look at the asm output, or benchmark gcc -O3 -march=native vs. gcc -O3 -march=native -fno-tree-vectorize). You may need -ffast-math for some kinds of FP vectorization.

C++ virtual dispatch:

Writing it the way you do in the question, with virtual inheritance and

std::vector<A*> vecA{};

void vec_virtual_pointers() {
    for(auto a: vecA)
        a->Update();
}

We get this loop from clang5.0 -O3 -march=skylake

   # rbx = &vecA[0]
.LBB2_1:                         # do{
    mov     rdi, qword ptr [rbx]   # load a pointer from the vector (will be the this pointer for Update())
    mov     rax, qword ptr [rdi]   # load the vtable pointer
    call    qword ptr [rax]        # memory-indirect call using the first entry in the vtable
    add     rbx, 8                 # pointers are 8 bytes
    cmp     r14, rbx
    jne     .LBB2_1              # }while(p != vecA.end())

So the final function pointer is at the end of a chain of three dependent loads. Out-of-order execution lets this overlap between iterations (if the branch predicts correctly), but that's a lot of overhead just in total instructions for the front-end, as well as in mispredict penalty. (call [m] is 3 uops, so just the loop itself is 8 uops, and can only issue one per 2 cycles on Skylake. Call/return has overhead too. If the callee is not totally trivial, we probably don't bottleneck on store-forwarding for pushing / popping the return address. Loop with function call faster than an empty loop. (I'm not sure about the throughput of independent store/reload operations on the same address. That would normally require memory renaming, which Skylake doesn't do, to not bottleneck on that if the callee is tiny like here.)

Clang's definition for C::Update() is

C::Update():                         # @C::Update()
    inc     dword ptr [rdi + 8]
    ret

If this needed to set up any constants before calculating something, it would be even more expensive to not have it inlined. So, with virtual dispatch, this probably runs at about one per 3 to 5 clocks, instead of about 1 member per clock, on Skylake. (Or 8 members per clock with AVX2 for non-virtual class B which doesn't waste space, and makes auto-vectorization work well.) http://agner.org/optimize/ says Skylake has one per 3 clock call throughput, so lets say 24x performance loss when the data is hot in L1D cache. Different microarchitectures will be different, of course. See the x86 tag wiki for more x86 perf info.

Union hack:

Probably you should never use this, but you can see from the asm that it will work on x86-64 with clang and gcc. I made an array of unions, and looped over it:

union upoly {
    upoly() {}   // needs an explicit constructor for compilers not to choke
     B b;
     C c;
} poly_array[1024];

void union_polymorph() {
    upoly *p = &poly_array[0];
    upoly *endp = &poly_array[1024];
    for ( ; p != endp ; p++) {
        A *base = reinterpret_cast<A*>(p);
        base->Update();           // virtual dispatch
    }
}

A B and C all have their vtable at the start, so I think this will generally work. We asm that's basically the same, with one less step of pointer-chasing. (I used a static array instead of a vector, since I was keeping things simple and C-like while sorting out what to cast.)

    lea     rdi, [rbx + poly_array]       ; this pointer
    mov     rax, qword ptr [rbx + poly_array]   ; load it too, first "member" is the vtable pointer
    call    qword ptr [rax]
    add     rbx, 16                       ; stride is 16 bytes per object
    cmp     rbx, 16384                    ; 16 * 1024
    jne     .LBB4_1

This is better, and touches less memory, but it's only slightly better for overhead.

`std::function` from `#include <functional>`

It can hold any kind of callable thing. But it has even more overhead than vtable dispatch, because it's allowed to be in an error-if-used state. So the inner loop has to check every instance for that, and trap if it is. Also, sizeof(std::function<void()>); is 32 bytes (on x86-64 System V ABI).

#include <functional>
// pretty crappy: checks for being possibly unset to see if it should throw().
std::vector<std::function<void()>> vecF{};
void vec_functional() {
    for(auto f: vecF)     f();
}

                                # do {
.LBB6_2:                                # =>This Inner Loop Header: Depth=1
    mov     qword ptr [rsp + 16], 0       # store a 0 to a local on the stack?
    mov     rax, qword ptr [rbx + 16]
    test    rax, rax
    je      .LBB6_5           # throw on pointer==0  (nullptr)
    mov     edx, 2            # third arg:  2
    mov     rdi, r14          # first arg: pointer to local stack memory (r14 = rsp outside the loop)
    mov     rsi, rbx          # second arg: point to current object in the vector
    call    rax               # otherwise call into it with 2 args
    mov     rax, qword ptr [rbx + 24]    # another pointer from the std::function<>
    mov     qword ptr [rsp + 24], rax    # store it to a local
    mov     rcx, qword ptr [rbx + 16]    # load the first pointer again
    mov     qword ptr [rsp + 16], rcx
    test    rcx, rcx
    je      .LBB6_5           # check the first pointer for null again (and throw if null)
    mov     rdi, r14
    call    rax               # call through the 2nd pointer
    mov     rax, qword ptr [rsp + 16]
    test    rax, rax
    je      .LBB6_12          # optionally skip a final call
    mov     edx, 3
    mov     rdi, r14
    mov     rsi, r14
    call    rax
.LBB6_12:                               #   in Loop: Header=BB6_2 Depth=1
    add     rbx, 32
    cmp     r15, rbx
    jne     .LBB6_2

.LBB6_13:                       # return
    add     rsp, 32
    pop     rbx
    pop     r14
    pop     r15
    ret

.LBB6_5:
    call    std::__throw_bad_function_call()
    jmp     .LBB6_16
    mov     rdi, rax
    call    __clang_call_terminate

So there are up to three call instructions unless the pointer is nullptr. This looks far worse than virtual dispatch.

It looks a bit different with clang -stdlib=libc++, instead of the default libstdc++. (https://libcxx.llvm.org/). But still three call instructions in the inner loop, with conditionals to skip them or throw.

Unless the code-gen is very different for different kinds of function<T>, it's probably not even worth looking at it for pointers to member functions if you can write a more efficient alternative.

Lanthanum answered 6/10, 2017 at 11:19 Comment(5)

You mention that classes like std::unique_ptr don't necessarily introduce any overhead versus say using raw pointers in a container. Unfortunately, this usually isn't true because such types prevent the various type-traits based optimizations that containers use to omit certain operations entirely (e.g., destruction of trivially destructible types) and optimize others (i.e., turning an object-by-object copy into a memcpy). – Biotope 9/10, 2017 at 4:23

@BeeOnRope: Thanks, fixed. – Lanthanum 9/10, 2017 at 4:41

I am interested in the upoly and "union hack" type stuff you mentioned. In particular, I'd like to know of a good way to contiguously store polymorphic objects and made a question on the topic. – Biotope 9/10, 2017 at 5:16

Peter, sorry for taking so long to accept the answer, I was very busy. I'm amazed and motivated by the complexity of the problem. I've read your answer twice already and there's still stuff I don't quite get about the assembly code, of which I have very general knowledge for the time being. I have to thank you for such a complete analysis and all the free love you've put into this. The separate containers solution is easily applicable to my problem, that's what I'll use. And I'll study all that's necessary to fully understand your answer. Cheers. – Purehearted 10/10, 2017 at 12:2

I just ran across PolyContainer which implements some of the stuff you mentioned: in particular segregated storage per type. It has some cool stuff like what they call "type restitution" which means the caller provides the full type of the object at the call site, and then you can iterate over all objects of that type with the compiler knowing the derived type, which can greatly help optimization. Benchmarks too. – Biotope 13/8, 2019 at 18:23

If you really need virtual dispatch, one method to speed up the dispatch for the same virtual method on a list of objects of varying derived types is to use what I'll call type-unswitching.

Somewhat analogously to loop unswitching, this transforms the single loop calling the method on every object in order into N loops (for N supported types) which each call the method on all objects of a specific type. This avoids the primary cost of unpredictable virtual dispatch: the branch mis-predictions implied by the indirect call of an unknown, unpredictable function in the vtable.

The generic implementation of this technique involves a first pass to partition the objects by type: information about this partition is used by the second pass which has separate loops for each each type¹, calling the method. This generally doesn't involve any unpredictable branches at all, if implemented carefully.

In the case of two derived classes B and C you can simply use a bitmap to store the type information. Here's an example implementation, using the types A, B, C from the code in the question:

void virtual_call_unswitch(std::vector<A*>& vec) {

    // first create a bitmap which specifies whether each element is B or C type
    std::vector<uint64_t> bitmap(vec.size() / 64);

    for (size_t block = 0; block < bitmap.size(); block++) {
        uint64_t blockmap = 0;
        for (size_t idx = block * 64; idx < block * 64 + 64; idx++) {
            blockmap >>= 1;    
            blockmap |= (uint64_t)vec[idx + 0]->typecode_ << 63;
        }
        bitmap[block] = blockmap;
    }

    // now loop over the bitmap handling all the B elements, and then again for all the C elements

    size_t blockidx;
    // B loop
    blockidx = 0;
    for (uint64_t block : bitmap) {
        block = ~block;
        while (block) {
            size_t idx = blockidx + __builtin_ctzl(block);
            B* obj = static_cast<B*>(vec[idx]);
            obj->Update();
            block &= (block - 1);
        }
        blockidx += 64;
    }

    // C loop
    blockidx = 0;
    for (uint64_t block : bitmap) {
        while (block) {
            size_t idx = blockidx + __builtin_ctzl(block);
            C* obj = static_cast<C*>(vec[idx]);
            obj->Update();
            block &= (block - 1);
        }
        blockidx += 64;
    }
}

Here, typecode is a common field in A which identifies the object type, 0 for B and 1 for C. Something similar is needed to make the categorization by type feasible (it can't be a virtual call, since making an unpredictable call is what we're trying to avoid in the first place).

A slightly optimized version of the above shows about a 3.5x speedup for the unswitched version over the plain virtually dispatched loop, with the virtual version clocking in about 19 cycles per dispatch, and the unswitched version at around 5.5. Full results:

-----------------------------------------------------------------------------
Benchmark                                      Time           CPU Iterations
-----------------------------------------------------------------------------
BenchWithFixture/VirtualDispatchTrue       30392 ns      30364 ns      23033   128.646M items/s
BenchWithFixture/VirtualDispatchFakeB       3564 ns       3560 ns     196712   1097.34M items/s
BenchWithFixture/StaticBPtr                 3496 ns       3495 ns     200506    1117.6M items/s
BenchWithFixture/UnswitchTypes              8573 ns       8571 ns      80437   455.744M items/s
BenchWithFixture/StaticB                    1981 ns       1981 ns     352397    1.9259G items/s

VirtualDispatchTrue is the simple loop calling Update() on a pointer of type A:

for (A *a : vecA) {
    a->Update();
}

VirtualDispatchFakeB casts the pointer to B* (regardless of what the underlying type is) before calling Update(). Since B::Update() is final, the compiler can fully de-virtualize and inline the call. Of course, this isn't doing the right thing at all: it's treating any C objects as B and so calling the wrong method (and is totally UB) - but it's here to estimate how fast you could call methods on a vector of pointers if every object was the same statically known type.

for (A *a : vecA) {
    ((B *)a)->Update();
}

StaticBPtr iterates over a std::vector<B*> rather than a std::vector<A*>. As expected the performance is the same as the "fake B" code above, since the target for Update() is statically known and fully inlinable. It's here as a sanity check.

UnswitchTypes is the type unswitching trick described above.

StaticB iterates over a std::vector<B>. That is, contiguously allocated B objects rather than a vector of pointers to B objects. This removes one level of indirection and shows something like the best case for this object layout².

The full source is available and released into the public domain.

Limitations

Side-effects and Order

The key limitation with this technique is that the order of Update() calls shouldn't matter. While Update() is still called once on each object, the order has clearly changed. As long as the object doesn't update any mutable global or shared state, this should be easy to satisfy.

Supports For Two Types

The code above supports only two types, based on the use of bitmap to record type information.

This restriction is fairly easy to remove. First, the bitmap approach can be extended. E.g., to support 4 types, two similar bitmaps can be created for which the corresponding bits of each bitmap essentially for a 2-bit field encoding the type. The loops are similar, except that in the outer loop they & and ~ the bitmaps together in ways that over all the 4 types. E.g.:

// type 1 (code 11)
for (size_t i = 0; i < bitmap1.size(); i++) {
        block = bitmap1[i] & bitmap2[i];
        ...
}


// type 2 (code 01)
for (size_t i = 0; i < bitmap1.size(); i++) {
        block = ~bitmap1[i] & bitmap2[i];
        ...
}

...

Another approach is to not use bitmaps at all, but simply store an array of indexes per type. Each index in an array points to an object of that type in the master array. Essentially it's a 1-pass radix sort on the type code. This probably makes the type sorting part a bit slower, but potentially speeds up the loop iteration logic (the x & (x - 1) and ctz stuff disappears, at the cost of another indirection).

Fixed Number of Supported Types

The code above supports a fixed number of compile-time known types (namely, B and C). If a new type is introduced, the code above will either break and will certainly fail to call Update() on these new types.

However, it is straightforward to add support for unknown types. Simply group all unknown types, and then for those types only, do a full virtual dispatch within the loop (i.e., call Update() directly on the A*). You'll pay the full price, but only for types which you didn't explicitly support! In this way, the technique retails the generality of the virtual dispatch mechanism.

PolyCollection

You might be interested in Boost's PolyCollection. It is basically a vector-like container specialized for this case: holding objects of varying polymorphic types and iterating over them in an efficient way.

It supports virtual method based polymorphism, but also function-like object polymorphism and duck-typing based polymorphism. It implements the "unswitching" described above by keeping the various object types segregated in storage: so it doesn't keep the insertion order between objects of differenet types. If it meets your needs, it could be a ready-made solution.

¹ Actually, you need only one loop per group of types that all share the same implementation of the virtual method, although this might be hard to implement in a generic manner since this information isn't readily available. For example if classes Y and Z both derive from X, but neither overrides the implementation of some virtual method from X, then all of X, Y and Z can be handled by the same loop.

² By "object layout" I mean B objects that still have virtual methods and hence a vtable. If you remove all the virtual methods and get rid of the vtable, things go much faster since the compiler then vectorizes the addition to the compactly arranged fields. The vtable messes that up.

Biotope answered 9/10, 2017 at 1:51 Comment(23)

You suggest using containers of array of indices instead of a bitmap. Even better, remove a level of indirection by copying pointers out of the vector<A*> into separate containers of vector<B*> and vector<C*>. After the sort pass, you're doing StaticBPtr. (This may double the storage requirement, if you 32-bit indices would have been ok, but now you need 64-bit pointers.) – Lanthanum 9/10, 2017 at 3:3

BTW, using uint32_t block may be better, because it shortens the loop-carried tzcnt / blsr dependency chain, and the shift/OR dep chain when sorting. This allows more OOOE overlap between blocks. (Also, it isn't horrible on 32-bit CPUs, in case anyone cares about ARM32 or something.) – Lanthanum 9/10, 2017 at 3:6

I'm not following the comment about the tzcnt/blsr dependency chain? There is only one carried dependency chain, through blsr, the rest of the chains hang off of that one and are short as far as I can see. In any case, I don't get how using a smaller block size helps. The main reason to use a larger block size is to cut down the mispredicts on block exit. About the initial bitmap generation, it runs at > 4 IPC on my box so there isn't that much room for improvement there. @PeterCordes – Biotope 9/10, 2017 at 3:14

Oops, tzcnt isn't loop-carried, I was picturing something stupid like incrementing a bit-position counter by tzcnt and using that to clear the bit, not blsr. Good point about fewer mispredicts with larger blocks, since it's not a hard-to-predict branch. You could do that branchlessly, with a load+cmov every time, but that would be far too much work that gets thrown away every time. For the partition bitmap loop, I guess it's typical that gcc's loops bottleneck on the front-end because it doesn't unroll at all by default. I didn't look at the asm, though. – Lanthanum 9/10, 2017 at 3:37

@PeterCordes Right, well the inner loop exit is "hard" to predict (i.e., it will always be predicted wrong), but it happens only every ~32 iterations in this case. Yes, I've tried the cmov approach, and it can pay off if the number of inner loop iterations (i.e., the popcount of the bitmap) is small, like say 5. Here, it is around 32, so the mispredict pays for itself with simpler code. – Biotope 9/10, 2017 at 3:56

One approach I've tried that can make the cmov approach work is to unroll the inner loop say 4 times. This amortizes the exit check and lets you use the cmov strategy rather than an inner loop. The problem, of course, is that the unrolling does the "wrong thing" when block becomes zero in the middle of the unrolled part. In some scenarios the wrong thing does not hard (other than a bit of waste), but here it would call Update() on blockidx + 64 which is definitely wrong. Sometimes you can "undo" the wrong thing in the part following the unrolled code, but I don't see a way here. – Biotope 9/10, 2017 at 3:59

@PeterCordes - good point about just copying the pointers! It might increase the storage by a lot more than 2x though - you might have been able to use 8-bit or 16-bit indexes if they are deltas (with a fallback slowpath when a delta happens to be too large). In the 8-bit case, for example, you could load 8 indexes with a 64-bit read which "mostly" removes the indirection, at the cost of some shifting and masking to use the read 8 bytes at a time. I should try the pure pointer approach though. – Biotope 9/10, 2017 at 4:2

Yeah, a branchy approach looks good here, with or without unrolling. (Unrolling spreads the branch pattern over multiple branches, unless you actually use popcnt to check that there are a multiple of 4 bits left or something.) – Lanthanum 9/10, 2017 at 4:3

You could in principle still use smaller than 64-bit pointers: store the delta from the first or previous pointer. Of course it might fail if two pointers are far apart, but for "typical" allocators and application footprints it would work (and you could have a fallback path if it failed). – Biotope 9/10, 2017 at 4:3

In the OP's case, the objects themselves might all be allocated from the same pool, so you could maybe use small indices into a Bpool and Cpool. But at that point you might as well just loop over Bpool and Cpool! Oh, but if you sometimes free a pointer, then yeah. Deltas from the previous pointer turns it into a linked-list traversal as far as data-dependent loads. Good locality, though. – Lanthanum 9/10, 2017 at 4:4

@PeterCordes - about spreading the branches, I was actually talking about a different kind of unrolling: without checks in-between the unrolled instructions (i.e,. the unrolled part just has a tzcnt, blsr, add [mem], 1 pretty much). Of course, it's not correct as above! You could unroll it properly with a check after each blsr, but it doesn't help much? I don't see spreading the branch pattern helping a lot here. If there's a pattern the single branch should work fine (and when global history is used the approaches are basically equivalent, but more branches waste BTB entires). – Biotope 9/10, 2017 at 4:7

Deltas doesn't turn it into a linked list traversal. The read of the next delta is not dependent on the prior. Linked list traversal is bad (i.e., embeds the load latency) because the next pointer is stored at the location pointed to by the prior one. That's not the case here where in the vector<B*> all the pointers are side-by-side. – Biotope 9/10, 2017 at 4:9

Oh right, silly me again. you're doing Bidx += Bdeltas[i], so there's still a new loop-carried dependency, but it's only on an add not a load. Makes it suck for SIMD gather, though. – Lanthanum 9/10, 2017 at 4:48

For SIMD gather you could make the deltas be based on the element at position cur - N rather than cur - 1, where N is the vector-width (in pointer sized units). Then you can load a vector of deltas and add the prior pointers, and gather/scatter to your hearts content. Note also I said deltas from the first or previous pointer. SIMD only has an issue with "previous". If you just store all deltas from a single fixed ("first") pointer there is no issue (just a sub against a vector with the reference point broadcast to each element). – Biotope 9/10, 2017 at 4:53

Re: branch-prediction patterns: Skylake correctly predicts loop-exits for loops with 22 or fewer iterations, but 24 or more is pretty much 1 mispredict per loop. So unrolling could actually help if there's a consistent pattern that's too long for a rolled-up loop to catch. Another advantage of unrolling with branches is that not-taken branches are better for the front-end than taken. But I think a popcnt / checking to set up for an unrolled loop with only one loop-branch is even better. – Lanthanum 9/10, 2017 at 4:53

The only real advantage of the "previous" approach is that the deltas might tend to be smaller if there is some trend in the addresses (e.g., increasing addresses from elements allocated back-to-back). In that case, you might be able to fit your delta into smaller elements if you use "previous". – Biotope 9/10, 2017 at 4:54

Yeah, delta-compressing the indices to save cache footprint sounded like an interesting idea worth considering. Good point that cur-N will still solve it. – Lanthanum 9/10, 2017 at 4:56

@PeterCordes - well the loop exit condition is totally unpredictable here, but lets consider your claim in a case where it is predictable (exits after a fixed number of iterations). Now the behavior you mention might have been true for a single branch in a loop, but that doesn't necessarily mean you can extrapolate that to a loop with several branches. In particular, modern Intel seems to be using global history, so splitting up the branches doesn't actually change the history that each branch sees! – Biotope 9/10, 2017 at 4:58

That is, if you have 4 branches in your loop instead of 1, each branch sees in its history the pattern of the other three branches, so the required history is "just as long" to predict the exit correctly. This is different than the case of a fixed 32-iteration loop. There you can break it into two separate 16-iteration loops and both may be predicted correctly, because each loop still only has one branch and the length of the required history has actually been lessened. – Biotope 9/10, 2017 at 4:59

Ah, good point, I haven't tried with a not-taken branch inside the loop. So a popcnt setup for an unrolled loop might have an even bigger advantage. The trick is to avoid mispredicts before a bunch of useful work gets into the pipeline... Maybe for(setbits=popcnt(block); setbits >= 4; setbits -= 4){ } while(setbits--) {} so the last-3 cleanup happens after the mainloop, and the branch before the first unrolled iteration only mispredicts if there were fewer than 4 bits set. – Lanthanum 9/10, 2017 at 5:6

Here's an extreme idea (not in any way pure C++ anymore): rather than using a bitmap or index/pointer lists, just generate the actual code to be executed as one big contiguous program. That is, in the "sorting" step generate the code for Update(), with the address hardcoded in. Pretty much just an add or sub in this case. Then just execute it! It's a case where JITing code that may only ever run once could still be faster than the alternatives. It mostly works if the code lengths are the same/similar, since otherwise the copy needs a branch. – Biotope 9/10, 2017 at 5:10

That's only sane if there's something stopping you from just using separate containers for B and C objects (not pointers), so you can write optimized loops over a container of contiguous B. The overhead of doing that when the container changes is probably similar to managing separate containers. The main thing it gains is extremely good performance when there's a required order between B and C objects, so you can't just do all-the-Bs then all-the-Cs. – Lanthanum 9/10, 2017 at 5:15

Yes for sure. It is "useful" only if the order matters, or because of some external constraint you cannot change you end up with an array of pointers. – Biotope 9/10, 2017 at 5:19

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++