Clean, self-contained VM implemented in C and under 100-200K compiled code size?
Asked Answered
M

7

25

I'm looking for a VM with the following features:

  • Small compiled code footprint (under 200K).
  • No external dependencies.
  • Unicode (or raw) string support.
  • Clean code/well organized.
  • C(99) code, NOT C++.
  • C/Java-like syntax.
  • Operators/bitwise: AND/OR, etc.
  • Threading support.
  • Generic/portable bytecode. Bytecode should work on different machines even if it was compiled on a different architecture with different endianness etc.
  • Barebones, nothing fancy necessary. Only the basic language support.
  • Lexer/parser and compiler separate from VM. I will be embedding the VM in a program and then compile the bytecode independently.

So far I have reviewed Lua, Squirrel, Neko, Pawn, Io, AngelScript... and the only one which comes somewhat close to the spec is Lua, but the syntax is horrible, it does not have bitwise support, and the code style generally sucks. Squirrel and IO are huge, mostly. Pawn is problematic, it is small, but bytecode is not cross platform and the implementation has some serious issues (ex bytecode is not validated at all, not even the headers AFAIK).

I would love to find a suitable option out there.

Thanks!

Update: Javascript interpreters are... interpreters. This is a VM question for a bytecode-based VM, hence the compiler/bytecode vm separation requirement. JS is interpreted, and very seldom compiled by JIT. I don't want JIT necessarily. Also, all current ECMAScript parsers are all but small.

Monoatomic answered 12/3, 2011 at 0:49 Comment(9)
"The syntax is horrible" isn't very helpful as to guessing what kind of language you do want.Moffatt
@larsmans: He states what kind of language he's looking for: "C/Java-like syntax"Chequered
As I said, C/Java-like syntax. I don't need class inheritance/complex OO support honestly, but Lua's syntax is a dealbreaker. I want to have something attractive for developers coming from a C or Java background. Similar to Pawn/Squirrel's syntax sans the aforementioned "extra candy".Monoatomic
Have a look at V8 (JavaScript) and tcc (ANSI C). Maybe you can VMize one of them ...Hydromel
I'm checking at the moment, but tcc does not qualify, even if it's a cool project, it compiles to native code. I'll look into it but my question was geared towards finding an existent project, not one I can 'adapt' with significant work (ex. to LUA vm).Monoatomic
TCC is nice but definitely would require a lot of work before it could produce bytecode for an existent VM :( Pawn for that matter would be an option, but the bytecode parser is still looking bad (I wont trust it as it is now).Monoatomic
Heh, I had a project a while back similar to this. I never did get the VM complete and never even got a parser/language support started however :(Elsey
Io's core is about 400 lines of code, including evaluator. Hardly huge. It's the libraries that consume much of the rest. However it's not a bytecode VM, it's a tree walker.Gender
well I offer a view at my VM then, I made it to code C as scripts instead of programs. It's still a work in progress but maybe you can help me with direction -> github.com/assyrianic/C-Virtual-MachineFillagree
M
4

Finally after all this time none of the answers really did it. I ended up forking LUA. As of today no self contained VM with the above requirements exists... it's a pity ;(

Nonetheless, Pawn is fairly nice, if only the code wasn't kind of problematic.

Monoatomic answered 28/4, 2011 at 5:17 Comment(5)
Is your Lua fork freely available? Or can you share the mods you made?Diastyle
Soze, I'd be interested in that tooMembranous
According to their respective sites: Lua source contains around 20000 lines of C, and under Linux, the Lua interpreter built with all standard Lua libraries takes 182K and the Lua library takes 243K. Both the Squirrel compiler and virtual machine fit together in about 7k lines of C++ code and add only around 100kb-150kb the executable size. So why did you say Squirrel is 'huge'?Cloakanddagger
At the moment of testing I believe that was not the case, and also the functionality was not in parallel to Lua, even if Squirrel's syntax is nicer, as-is.Monoatomic
As for the mod: it has never been published but if I have the time I might try submitting patches upstream, or passing them on to someone who will. It is likely they won't be accepted, though. LUA has never been receptive of people splitting the lexer and compiler apart from the interpreter. There is also the issue of bytecode architecture portability. LUA does not have a big-endian/little-endian agnostic interpreter.Monoatomic
P
6

You say you've reviewed NekoVM, but don't mention why it's not suitable for you.

It's written in C, not C++, the VM is under 10kLOC with a compiled size of roughly 100kB, and the compiler is a separate executable producing portable bytecode. The language itself has C-like syntax, bitwise operators, and it's not thread-hostile.

Pain answered 12/3, 2011 at 2:2 Comment(5)
It depends on the Boehm GC which I want to avoid. ("No external dependencies.") During my review, the API in alloc.c seems non trivial to replace with something else. Furthermore, the tinygc version of the Boehm GC does not support the necessary API as far as I know :(Monoatomic
Since the embedded language of NekoVM looks a lot like Javascript, it would be wise to go for Google V8 instead of NekoVM. Think about documentation, support and existing VM's using V8 (node.js).Noam
@Noam V8 doesn't satisfy OP's other requirements. It's written in C++ and has a much larger (4MB) compiled code size.Pain
@Pain That is depending completly on how and what you are using of V8, besides, the almost 4 MB difference is relativly small if you got to link the VMs with libraries. Both VMs are completly empty except the built-in types. About the requirements, both NekoVM and V8 dont satify almost every requirement. Just wanted to add for anyone who saw your answer and was looking for something like it, that V8 would be the better choice (in my opinion). :)Noam
@Noam I like V8 too, but really, NekoVM is pure-C and compiles down to ~100KB, which was important to the original poster. V8 is nowhere close to that -- the 4MB binary size is counting just V8, not other libraries.Pain
M
4

Finally after all this time none of the answers really did it. I ended up forking LUA. As of today no self contained VM with the above requirements exists... it's a pity ;(

Nonetheless, Pawn is fairly nice, if only the code wasn't kind of problematic.

Monoatomic answered 28/4, 2011 at 5:17 Comment(5)
Is your Lua fork freely available? Or can you share the mods you made?Diastyle
Soze, I'd be interested in that tooMembranous
According to their respective sites: Lua source contains around 20000 lines of C, and under Linux, the Lua interpreter built with all standard Lua libraries takes 182K and the Lua library takes 243K. Both the Squirrel compiler and virtual machine fit together in about 7k lines of C++ code and add only around 100kb-150kb the executable size. So why did you say Squirrel is 'huge'?Cloakanddagger
At the moment of testing I believe that was not the case, and also the functionality was not in parallel to Lua, even if Squirrel's syntax is nicer, as-is.Monoatomic
As for the mod: it has never been published but if I have the time I might try submitting patches upstream, or passing them on to someone who will. It is likely they won't be accepted, though. LUA has never been receptive of people splitting the lexer and compiler apart from the interpreter. There is also the issue of bytecode architecture portability. LUA does not have a big-endian/little-endian agnostic interpreter.Monoatomic
K
4

JerryScript:

  • requires less than 64 KB of RAM
  • ~160 KB binary size
  • written in C99
  • VM based
  • has bytecode precompilation

IoT JavaScript glues JerryScript with libuv (nodejs style) - it may be easier to play with.

Threading is probably not there in a state you want. There are recent additions to ECMAScript around background workers on separate threads and shared, cross-thread buffers - not sure what's the story with it in JerryScript - probably not there yet, but who knows - they have a blueprint for how to do it, may not be far.

Kugler answered 12/4, 2017 at 15:43 Comment(0)
S
3

For something very "barebones" :

http://en.wikibooks.org/wiki/Creating_a_Virtual_Machine/Register_VM_in_C

More of a short introduction to the topic than anything else, granted.

Yet, it probably meets at least these few of the desired criteria :

  • Small compiled code footprint (under 200K) ... check, obviously;
  • No external dependencies ... check;
  • Clean code/well organized ... check;
  • C(99) code, NOT C++ ... check;
  • C/Java-like syntax ... check.
Sternlight answered 14/4, 2014 at 9:44 Comment(1)
Not in scope. C was the syntax wanted, not the implementation language.Monoatomic
M
1

Try EmbedVM.

http://www.clifford.at/embedvm/

http://svn.clifford.at/embedvm/trunk/

Here's an example of some code, a guessing game. The compiler is built in C with lex+yacc:

global points;

function main()
{
    local num, guess;
    points = 0;
    while (1)
    {
        // report points
        $uf4();

        // get next random number
        num = $uf0();
        do {
            // read next guess
            guess = $uf1();
            if (guess < num) {
                // hint to user: try larger numbers
                $uf2(+1);
                points = points - 1;
            }
            if (guess > num) {
                // hint to user: try smaller numbers
                $uf2(-1);
                points = points - 1;
            }
        } while (guess != num);

        // level up!
        points = points + 10;
        $uf3();
    }
}

There isn't any threading support. But there's no global state in the VM, so it's easy to run multiple copies in the same process.

The API is simple. VM RAM is accessed via callbacks. Your main loop calls embedvm_exec(vmdata) repeatedly, it executes a single operation and returns.

The VM has a tiny footprint and has been used on 8-bit microcontrollers.

Mycostatin answered 10/4, 2012 at 19:16 Comment(1)
Unfortunately no UTF-8 support, among other things. The VM itself is quite nice, though.Monoatomic
T
0

Try embedding a JavaScript interpreter in your code.

http://www.mozilla.org/js/spidermonkey/

Torn answered 12/3, 2011 at 1:18 Comment(6)
Spidermonkey does not qualify with the criteria I provided.Monoatomic
@soze which ones does it miss?Torn
If you read the Spidermonkey feature set and its requirements you will realize it is not a bytecode VM, certainly well over 200K in compiled code size, no separation whatsoever between bytecode and compiler (because there's none... it's a code parser/lexer implementation) and it is not barebones, and the use case is completely different.Monoatomic
@soze Ah, I was assuming interpreters were ok because you mentioned Lua as a candidate. I wasn't aware that you could precompile lua scripts to byte code.Torn
Lua is always compiled to bytecode before being run. Like many other "interpreted" languages (Perl and Python come to mind as other languages that always run from bytecode), this is done automatically by the runtime when given text as an input program. You'll find that very few interpreters work directly off of text or parse trees, and as such, even SpiderMonkey has an internal bytecode representation, but it doesn't appear to be as well defined.Pain
Well, Javascript and some other languages are translated to native code (call it 'metal level bytecode' ;) ) in JIT-capable engines. Some can do that with bytecode too, think about .NET or Java. I just found out Lua does not support cross platform bytecode. If word size changes you are out of luck. :( So for that matter it's still worse than Pawn.Monoatomic
K
0

On option is to use something minimal and extend it. mini-vm is under 200 lines of code, including comments, it has a liberal license (MIT), it's written in C. Out of the box it supports 0 operations, but it is very easy to extend. The included example compiler is only a simple calculator. But one could easily imagine adding comparisons, branches, memory access, and supervisor calls to take it where you want to go. A VM that is easy to extend is especially useful for developing domain specific languages, and having multiple languages target your flavor of mini-vm would be straight forward other than having to implement multiple compilers (or port them. the QuakeC compiler is just lcc, and very easy to retarget).

Threading support would have to be an extension, and the core VM would not play nicely in a multiprocessor pthread scenario (heavyweight threading). Weirdly mini-vm can have a pc (program counter) per heavyweight thread, but would share registers among all threads using the same context. Running separate contexts would be thread-safe though.

I'm skipping answering the requirements on language because the question starts off asking for a barebones VM. But at the same time demands C/Java like syntax, not sure how to resolve that conflict other than stating this conflict.

Klagenfurt answered 4/3, 2017 at 1:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.