How to speed up dynamic dispatch by 20% using computed gotos in standard C++
Asked Answered
P

1

13

Before you down-vote or start saying that gotoing is evil and obsolete, please read the justification of why it is viable in this case. Before you mark it as duplicate, please read the full question.

I was reading about virtual machine interpreters, when I stumbled across computed gotos . Apparently they allow significant performance improvement of certain pieces of code. The most known example is the main VM interpreter loop.

Consider a (very) simple VM like this:

#include <iostream>

enum class Opcode
{
    HALT,
    INC,
    DEC,
    BIT_LEFT,
    BIT_RIGHT,
    RET
};

int main()
{
    Opcode program[] = { // an example program that returns 10
        Opcode::INC,
        Opcode::BIT_LEFT,
        Opcode::BIT_LEFT,
        Opcode::BIT_LEFT,
        Opcode::INC,
        Opcode::INC,
        Opcode::RET
    };
    
    int result = 0;

    for (Opcode instruction : program)
    {
        switch (instruction)
        {
        case Opcode::HALT:
            break;
        case Opcode::INC:
            ++result;
            break;
        case Opcode::DEC:
            --result;
            break;
        case Opcode::BIT_LEFT:
            result <<= 1;
            break;
        case Opcode::BIT_RIGHT:
            result >>= 1;
            break;
        case Opcode::RET:
            std::cout << result;
            return 0;
        }
    }
}

All this VM can do is a few simple operations on one number of type int and print it. In spite of its doubtable usefullness, it illustrates the subject nonetheless.

The critical part of the VM is obviously the switch statement in the for loop. Its performance is determined by many factors, of which the most inportant ones are most certainly branch prediction and the action of jumping to the appropriate point of execution (the case labels).

There is room for optimization here. In order to speed up the execution of this loop, one might use, so called, computed gotos.

Computed Gotos

Computed gotos are a construct well known to Fortran programmers and those using a certain (non-standard) GCC extension. I do not endorse the use of any non-standard, implementation-defined, and (obviously) undefined behavior. However to illustrate the concept in question, I will use the syntax of the mentioned GCC extension.

In standard C++ we are allowed to define labels that can later be jumped to by a goto statement:

goto some_label;

some_label:
    do_something();

Doing this isn't considered good code (and for a good reason!). Although there are good arguments against using goto (of which most are related to code maintainability) there is an application for this abominated feature. It is the improvement of performance.

Using a goto statement can be faster than a function invocation. This is because the amount of "paperwork", like setting up the stack and returning a value, that has to be done when invoking a function. Meanwhile a goto can sometimes be converted into a single jmp assembly instruction.

To exploit the full potential of goto an extension to the GCC compiler was made that allows goto to be more dynamic. That is, the label to jump to can be determined at run-time.

This extension allows one to obtain a label pointer, similar to a function pointer and gotoing to it:

    void* label_ptr = &&some_label;
    goto (*label_ptr);

some_label:
    do_something();

This is an interesting concept that allows us to further enhance our simple VM. Instead of using a switch statement we will use an array of label pointers (a so called jump table) and than goto to the appropriate one (the opcode will be used to index the array):

// [Courtesy of Eli Bendersky][4]
// This code is licensed with the [Unlicense][5]

int interp_cgoto(unsigned char* code, int initval) {
    /* The indices of labels in the dispatch_table are the relevant opcodes
    */
    static void* dispatch_table[] = {
        &&do_halt, &&do_inc, &&do_dec, &&do_mul2,
        &&do_div2, &&do_add7, &&do_neg};
    #define DISPATCH() goto *dispatch_table[code[pc++]]

    int pc = 0;
    int val = initval;

    DISPATCH();
    while (1) {
        do_halt:
            return val;
        do_inc:
            val++;
            DISPATCH();
        do_dec:
            val--;
            DISPATCH();
        do_mul2:
            val *= 2;
            DISPATCH();
        do_div2:
            val /= 2;
            DISPATCH();
        do_add7:
            val += 7;
            DISPATCH();
        do_neg:
            val = -val;
            DISPATCH();
    }
}

This version is about 25% faster than the one that uses a switch (the one on the linked blog post, not the one above). This is because there is only one jump performed after each operation, instead of two.

Control flow with switch: 2 jumps with switch For example, if we wanted to execute Opcode::FOO and then Opcode::SOMETHING, it would look like this: enter image description here As you can see, there are two jumps being performed after an instruction is executed. The first one is back to the switch code and the second is to the actual instruction.

In contrary, if we would go with an array of label pointers (as a reminder, they are non-standard), we would have only one jump: enter image description here

It is worthwhile to note that in addition to saving cycles by doing less operations, we also enhance the quality of branch prediction by eliminating the additional jump.

Now, we know that by using an array of label pointers instead of a switch we can improve the performance of our VM significantly (by about 20%). I figured that maybe this could have some other applications too.

I came to the conclusion that this technique could be used in any program that has a loop in which it sequentially indirectly dispatches some logic. A simple example of this (apart from the VM) could be invoking a virtual method on every element of a container of polymorphic objects:

std::vector<Base*> objects;
objects = get_objects();
for (auto object : objects)
{
    object->foo();
}

Now, this has much more applications.

There is one problem though: There is nothing such as label pointers in standard C++. As such, the question is: Is there a way to simulate the behaviour of computed gotos in standard C++ that can match them in performance?.

Edit 1:

There is yet another down side to using the switch. I was reminded of it by user1937198. It is bound checking. In short, it checks if the value of the variable inside of the switch matches any of the cases. It adds redundant branching (this check is mandated by the standard).

Edit 2:

In response to cmaster, I will clarify what is my idea on reducing overhead of virtual function calls. A dirty approach to this would be to have an id in each derived instance representing its type, that would be used to index the jump table (label pointer array). The problem is that:

  1. There are no jump tables is standard C++
  2. It would require as to modify all jump tables when a new derived class is added.

I would be thankful, if someone came up with some type of template magic (or a macro as a last resort), that would allow to write it to be more clean, extensible and automated, like this:

Putscher answered 8/11, 2019 at 21:47 Comment(12)
AFAIK computed goto's is the way to go. Being only usable with GCC isn't a show stopper for a code base.Thirion
@NathanOliver-ReinstateMonica Although GCC does indeed support many platforms, I fell uncomfortable using non-standard behavior as it can theoretically change at any time.Putscher
I remember someone telling me that switches are implemented in terms of gotos underneath, so it doesn't make sense to me that that would be the case. But I can't verify that. And that's the only productive thing I can give to this conversation.Parashah
@Chipster You are right. switches are essentially gotos. The thing is that the switch is actually two gotos, as you can see in the schematic.Putscher
Which compiler and optimization level are you testing? clang++ 9.0 compiles your switch example to a jump table with and additional check for if no branch is met, with no check if you add a default with builtin unreachable: gcc.godbolt.org/z/ywDqmKGilberto
@Gilberto The thing is, I am confined to using MSVC... Regardless, thanks for mentioning the bounds check - I wanted to mention it, but forgot.Putscher
I'm just waiting for a template wizard to come up with a solution to this ;-) Honestly, the main problem with computed goto is, that the input is usually not well-behaved: A VM that's defined for software emulation typically uses at most 256 different OP-codes, putting a strict bound on the dispatch table size. But general dispatch, like that done with v-tables in C++, does not provide this luxury. The v-tables (= class IDs) can be virtually anywhere in memory, so you cannot create a dispatch table for them. That said, a v-table is a form of computed goto (+ function call overhead).Bhutan
@cmaster The thing is that the function call overhead can be entirely omitted, if instead of calling the functions (and then returning from them) we would jump from one the next next one. I will clarify this in the question, thanks.Putscher
By the way in assembly this trick has a version without a table, by actually computing the address rather than looking it up (requires some padding).Hadlee
@harold It sounds really interesting. Could you post a link to this method?Putscher
@YanB. a special case version was used in this question, I couldn't track down a good reference but it's a "known technique in assembly folklore" I guess? Also you might like thisHadlee
@harold Big thanks for linking to that. I will surely look for more info on the subject, it looks interesting.Putscher
G
5

On a recent versions of MSVC, the key is to give the optimizer the hints it needs so that it can tell that just indexing into the jump table is a safe transform. There are two constraints on the original code that prevent this, and thus make optimising to the code generated by the computed label code an invalid transform.

Firstly in the original code, if the program counter overflows the program, then the loop exits. In the computed label code, undefined behavior (dereferencing an out of range index) is invoked. Thus the compiler has to insert a check for this, causing it to generate a basic block for the loop header rather than inlining that in each switch block.

Secondly in the original code, the default case is not handled. Whilst the switch covers all enum values, and thus it is undefined behavior for no branches to match, the msvc optimiser is not intelligent enough to exploit this, so generates a default case that does nothing. Checking this default case requires a conditional as it handles a large range of values. The computed goto code invokes undefined behavior in this case as well.

The solution to the first issue is simple. Don't use a c++ range for loop, use a while loop or a for loop with no condition. The solution for the second unfortunatly requires platform specific code to tell the optimizer the default is undefined behavior in the form of _assume(0), but something analogous is present in most compilers (__builtin_unreachable() in clang and gcc), and can be conditionally compiled to nothing when no equivalent is present without any correctness issues.

So the result of this is:

#include <iostream>

enum class Opcode
{
    HALT,
    INC,
    DEC,
    BIT_LEFT,
    BIT_RIGHT,
    RET
};

int run(Opcode* program) {
    int result = 0;
    for (int i = 0; true;i++)
    {
        auto instruction = program[i];
        switch (instruction)
        {
        case Opcode::HALT:
            break;
        case Opcode::INC:
            ++result;
            break;
        case Opcode::DEC:
            --result;
            break;
        case Opcode::BIT_LEFT:
            result <<= 1;
            break;
        case Opcode::BIT_RIGHT:
            result >>= 1;
            break;
        case Opcode::RET:
            std::cout << result;
            return 0;
        default:
            __assume(0);
        }
    }
}

The generated assembly can be verified on godbolt

Gilberto answered 8/11, 2019 at 22:36 Comment(8)
s/HALT/NOOP/ ;-)Oswin
@Oswin Just copying the naming in the original code, and that seems slightly orthogonal to the issue of jump tables.Gilberto
Thanks for your answer. After initial benchmarking it does seem to improve performance to a point with a margin of error from computed goto. I would still appreciate if you showed a solution to the generalized problem of invoking virtual functions in a loop (the switch is a specific case of this problem - each opcode could be an object with a virtual function inherited from a common base abstract class).Putscher
By the way, NOOP would have really been a better name for HALT. ;)Putscher
@YanB. The generalized case is much harder. The fact that virtual function calls are an open set as opposed to the closed set of a switch mean that the optimizer can't build a jump table.Gilberto
@Gilberto The optimizer can't do it by itself, but it might be possible to do something (maybe somehow mark the virtual functions with some type of a macro?) to allow it to construct such table.Putscher
Add LTO to the build with appropriate flags and the optimizer may be able to figure out all the virtuals on its own and further improve things.Gidgetgie
I knew it! I was just wondering if there is a more portable way to do this (porting some VM from C to C++), and this is indeed the more portable way - it works the same way on all major compilers. In fact, you can even use std::abort or any other [[noreturn]] (with new C++ standards) function to obtain the same result - although it would add one bound check at the start of the switch before the label dispatchWaddington

© 2022 - 2024 — McMap. All rights reserved.