What are some obvious optimizations for a virtual machine implementing a functional language?
Asked Answered
M

2

12

I'm working on an intermediate language and a virtual machine to run a functional language with a couple of "problematic" properties:

  • Lexical namespaces (closures)
  • Dynamically growing call stack
  • A slow integer type (bignums)

The intermediate language is stack based, with a simple hash-table for the current namespace. Just so you get an idea of what it looks like, here's the McCarthy91 function:

# McCarthy 91: M(n) = n - 10 if n > 100 else M(M(n + 11))
.sub M
    args
    sto n

    rcl n
    float 100
    gt
    .if
        .sub
            rcl n
            float 10
            sub
        .end
        .sub
            rcl n
            float 11
            add
            list 1
            rcl M
            call-fast
            list 1
            rcl M
            tail
        .end
    call-fast
.end

The "big loop" is straightforward:

  • fetch an instruction
  • increment the instruction pointer (or program counter)
  • evaluate the instruction

Along with sto, rcl and a whole lot more, there are three instructions for function calls:

  • call copies the namespace (deep copy) and pushes the instruction pointer onto the call stack
  • call-fast is the same, but only creates a shallow copy
  • tail is basically a 'goto'

The implementation is really straightforward. To give you a better idea, here's just a random snippet from the middle of the "big loop" (updated, see below)

    } else if inst == 2 /* STO */ {
        local[data] = stack[len(stack) - 1]
        if code[ip + 1][0] != 3 {
            stack = stack[:len(stack) - 1]
        } else {
            ip++
        }
    } else if inst == 3 /* RCL */ {
        stack = append(stack, local[data])
    } else if inst == 12 /* .END */ {
        outer = outer[:len(outer) - 1]
        ip = calls[len(calls) - 1]
        calls = calls[:len(calls) - 1]
    } else if inst == 20 /* CALL */ {
        calls = append(calls, ip)
        cp := make(Local, len(local))
        copy(cp, local)
        outer = append(outer, &cp)
        x := stack[len(stack) - 1]
        stack = stack[:len(stack) - 1]
        ip = x.(int)
    } else if inst == 21 /* TAIL */ {
        x := stack[len(stack) - 1]
        stack = stack[:len(stack) - 1]
        ip = x.(int)

The problem is this: Calling McCarthy91 16 times with a value of -10000 takes, near as makes no difference, 3 seconds (after optimizing away the deep-copy, which adds nearly a second).

My question is: What are some common techniques for optimizing interpretation of this kind of language? Is there any low-hanging fruit?

I used slices for my lists (arguments, the various stacks, slice of maps for the namespaces, ...), so I do this sort of thing all over the place: call_stack[:len(call_stack) - 1]. Right now, I really don't have a clue what pieces of code make this program slow. Any tips will be appreciated, though I'm primarily looking for general optimization strategies.

Aside:

I can reduce execution time quite a bit by circumventing my calling conventions. The list <n> instruction fetches n arguments of the stack and pushes a list of them back onto the stack, the args instruction pops off that list and pushes each item back onto the stack. This is firstly to check that functions are called with the correct number of arguments and secondly to be able to call functions with variable argument-lists (i.e. (defun f x:xs)). Removing that, and also adding an instruction sto* <x>, which replaces sto <x>; rcl <x>, I can get it down to 2 seconds. Still not brilliant, and I have to have this list/args business anyway. :)

Another aside (this is a long question I know, sorry):

Profiling the program with pprof told me very little (I'm new to Go in case that's not obvious) :-). These are the top 3 items as reported by pprof:

  16   6.1%   6.1%       16   6.1% sweep               pkg/runtime/mgc0.c:745
   9   3.4%   9.5%        9   3.4% fmt.(*fmt).fmt_qc   pkg/fmt/format.go:323
   4   1.5%  13.0%        4   1.5% fmt.(*fmt).integer  pkg/fmt/format.go:248

These are the changes I've made so far:

  • I removed the hash table. Instead I'm now passing just pointers to arrays, and I only efficiently copy the local scope when I have to.
  • I replaced the instruction names with integer opcodes. Before, I've wasted quite a bit of time comparing strings.
  • The call-fast instruction is gone (the speedup wasn't measurable anymore after the other changes)
  • Instead of having "int", "float" and "str" instructions, I just have one eval and I evaluate the constants at compile time (compilation of the bytecode that is). Then eval just pushes a reference to them.
  • After changing the semantics of .if, I could get rid of these pseudo-functions. it's now .if, .else and .endif, with implicit gotos ànd block-semantics similar to .sub. (some example code)

After implementing the lexer, parser, and bytecode compiler, the speed went down a little bit, but not terribly so. Calculating MC(-10000) 16 times makes it evaluate 4.2 million bytecode instructions in 1.2 seconds. Here's a sample of the code it generates (from this).

The whole thing is on github

Mariannemariano answered 12/6, 2012 at 8:17 Comment(7)
Please do not use hashtables and any other kinds of name lookup for the lexically scoped languages! It does not make any sense at all. Your compiler can allocate registers statically. It is very easy to infer the captured environment set for each lambda abstraction.Mounting
I don't really understand what you mean yet. Could you whip up a short example, say for "rcl" and "sto"?Mariannemariano
Use the numeric slots for the arguments, variables and the closure variables. Introduce opcodes like 'ldarg N', 'starg N', 'ldloc N', 'stloc N', 'ldenv N', 'stenv N'. Resolve the variable names into the numbers when compiling. Build the captured variables lists by comparing the free and the bound variables lists for each lambda abstraction point. Introduce a special instruction for building a closure instance (should be similar to a call with a list of variables to be captured). It is how most of the functional languages are implemented (both VM and native). It can be very efficient.Mounting
that makes a lot of sense, exactly the kind of thing I was hoping to get. I'll remember to update the question when get a chance to implement this. thanks so far!Mariannemariano
Interesting question, though I would suspect the answer would be much more like a book :-). In any case, this question has a long history, you might have more luck in phrasing it in terms of optimizing abstract machines, but see many of Simon Jones' (or other haskellers) papers, and descriptions of things like the STG machine and its optimizations.Belligerent
@Mounting I got rid of the hashtable, and it made a huge difference. The same code that took three seconds before now takes 0.8 (and, as a bonus, I have a much better idea of where all the variables are).Mariannemariano
For comparison, that same code runs in 0.25ms in common lisp. So I'm very happy :)Mariannemariano
C
9

There are decades of research on things you can optimize:

Caniff answered 12/6, 2012 at 12:55 Comment(1)
@StefanoPalazzo you might also look at how to optimize CPS, as this is what many compilers are using under the hood.Belligerent
G
5

You should have efficient algorithmic representations for the various concepts of your interpreter. Doing deep copies on a hashtable looks like a terrible idea, but I see that you have already removed that.

(Yes, your stack-popping operation using array slices look suspect. You should make sure they really have the expected algorithmic complexity, or else use a dedicated data structure (... a stack). I'm generally wary of using all-purposes data structure such as Python lists or PHP hashtables for this usage, because they are not necessarily designed to handle this particular use case well, but it may be that slices do guarantee an O(1) pushing and popping cost under all circumstances.)

The best way to handle environments, as long as they don't need to be reified, is to use numeric indices instead of variables (de Bruijn indices (0 for the variable bound last), or de Bruijn levels (0 for the variable bound first). This way you can only keep a dynamically resized array for the environment and accessing it is very fast. If you have first-class closures you will also need to capture the environment, which will be more costly: you have to copy the part of it in a dedicated structure, or use a non-mutable structure for the whole environment. Only experiment will tell, but my experience is that going for a fast mutable environment structure and paying a higher cost for closure construction is better than having an immutable structure with more bookkeeping all the time; of course you should make an usage analysis to capture only the necessary variables in your closures.

Finally, once you have rooted out the inefficiency sources related to your algorithmic choices, the hot area will be:

  • garbage collection (definitely a hard topic; if you don't want to become a GC expert, you should seriously consider reusing an existing runtime); you may be using the GC of your host language (heap-allocations in your interpreted language are translated into heap-allocations in your implementation language, with the same lifetime), it's not clear in the code snippet you've shown; this strategy is excellent to get something reasonably efficient in a simple way

  • numerics implementation; there are all kind of hacks to be efficient when the integers you manipulate are in fact small. Your best bet is to reuse the work of people that have invested tons of effort on this, so I strongly recommend you reuse for example the GMP library. Then again, you may also reuse your host language support for bignum if it has some, in your case Go's math/big package.

  • the low-level design of your interpreter loop. In a language with "simple bytecode" such as yours (each bytecode instruction is translated in a small number of native instructions, as opposed to complex bytecodes having high-level semantics such as the Parrot bytecode), the actual "looping and dispatching on bytecodes" code can be a bottleneck. There has been quite a lot of research on what's the best way to write such bytecode dispatch loops, to avoid the cascade of if/then/else (jump tables), benefit from the host processor branch prediction, simplify the control flow, etc. This is called threaded code and there are a lot of (rather simple) different techniques : direct threading, indirect threading... If you want to look into some of the research, there is for example work by Anton Ertl, The Structure and Performance of Efficient Interpreters in 2003, and later Context threading: A flexible and efficient dispatch technique for virtual machine interpreters. The benefits of those techniques tend to be fairly processor-sensitive, so you should test the various possibilities yourself.

While the STG work is interesting (and Peyton-Jones book on programming language implementation is excellent), it is somewhat oriented towards lazy evaluation. Regarding the design of efficient bytecode for strict functional languages, my reference is Xavier Leroy's 1990 work on the ZINC machine: The ZINC experiment: An Economical Implementation of the ML Language, which was ground-breaking work for the implementation of ML languages, and is still in use in the implementation of the OCaml language: there are both a bytecode and a native compiler, but the bytecode still uses a glorified ZINC machine.

Grilse answered 16/6, 2012 at 17:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.