Common stack-based VM bytecode optimizations?

Ok, I'm posting this fearing that it might be closed before anyone ever reads it - I'm quite used to that - but I'll give it a try... even pointing me to the right direction or some existing answer that does contain a specific answer would definitely do...

So, after this brief intro... I'm currently writting a bytecode interpreter, in C, (stack-based VM) for a programming language I have designed.

If you want to have a look at the supported opcodes, feel free to check them out here: https://github.com/arturo-lang/arturo/blob/master/src/vm/opcodes.h

There is nothing really special about the stack machine. Values are being pushed and popped, and operators and functions work on them, pushing the evaluation result back to the stack. So far so good.

Now, I'm at the point where all the core functionality is in and I'm trying to give it an extra boost by doing further optimizations.

Here's an example (and hopefully a rather illustrative one).

Input:

fibo: $(x){
    if x<2 {
        return 1
    } {
        return [fibo x-1] + [fibo x-2]
    }
}

i: 0
loop i<34 {
    print "fibo(" + i + ") = " + [fibo i]
    i: i+1
}

Bytecode produced:

|== Data Segment /======================>
0   : [Func  ]= function <5,1>
1   : [Int   ]= 34
2   : [String]= fibo(
3   : [String]= ) = 
==/ Data Segment =======================|

|== Bytecode Listing /======================>
0   :0       JUMP [Dword] 31
1   :5       LLOAD0
2   :6       IPUSH2
3   :7       CMPLT
4   :8       JMPIFNOT [Dword] 20
5   :13      IPUSH1
6   :14      RET
7   :15      JUMP [Dword] 30
8   :20      LLOAD0
9   :21      IPUSH1
10  :22      SUB
11  :23      GCALL0
12  :24      LLOAD0
13  :25      IPUSH2
14  :26      SUB
15  :27      GCALL0
16  :28      ADD
17  :29      RET
18  :30      RET
19  :31      CPUSH0
20  :32      GSTORE0
21  :33      IPUSH0
22  :34      GSTORE1
23  :35      GLOAD1
24  :36      CPUSH1
25  :37      CMPLT
26  :38      JMPIFNOT [Dword] 61
27  :43      CPUSH2
28  :44      GLOAD1
29  :45      ADD
30  :46      CPUSH3
31  :47      ADD
32  :48      GLOAD1
33  :49      GCALL0
34  :50      ADD
35  :51      DO_PRINT
36  :52      GLOAD1
37  :53      IPUSH1
38  :54      ADD
39  :55      GSTORE1
40  :56      JUMP [Dword] 35
41  :61      END

==/ Bytecode Listing =======================|

For anyone who has worked with compilers, bytecode interpreters or even JVM, the code above should be familiar.

What I want?

Ideas - general or specific ones - about how to further optimize my bytecode.

For examples, every *2 (that is: IPUSH2 followed by a MUL instruction) is converted to: IPUSH1, SHL since it's a faster operation.

What else would you suggest? Is there anywhere a list of such things to optimize? Could you suggest something concrete?

Thanks in advance! :)

The example you give is not particularly good, because the performance gain for an interpreter is very low if it makes a shift instead of a multiplication. The overhead of executing a single byte code instruction at all outnumbers the gain of this particular optimization in a order of several magnitudes.

The highest performance gain for an interpreter is to minimize the number of instructions that need to be performed. For example, accumulate two succeeding additions or subtractions on the same register to a single operation when possible.

To be able to make this kind of optimizations, you should try to identify so-called Basic Blocks (these are blocks where either all or no instructions are executed, i.e. no jumps in or out of the block happens) and optimize the number of instructions in those blocks by substituting several instructions into a single one while maintaining the same code semantics.

If you really mean it, you can also try to write a gcc backend for your language to compile it to bytecode; this way you can benefit from gcc's sophisticated optimization methods on the intermediate code representation (RTL).

Recommended topics

Hot tags