Compiling functional languages to C

Asked 8/8, 2011 at 8:43 Answered 8/8, 2011 at 9:17

Solved c compiler-construction functional-programming garbage-collection language-implementation

Suppose you're compiling a functional language to portable C, and suppose also that for various reasons you want precise rather than conservative garbage collection. There is no portable way (perhaps no way at all in the general case) for the garbage collector to figure out what is and isn't a pointer on the C stack. It seems to me there are two solutions to this problem:

Shadow stack. Make each C function maintain bookkeeping information about what is and isn't a pointer. This is the approach recommended by e.g. LLVM.
Take advantage of the fact that you are compiling a functional language, which means mainline code has no side effects. When the allocator detects out of memory, instead of calling the garbage collector itself, it aborts the current operation with a longjmp back to the main loop, which calls the garbage collector (in a context where the set of variables that may contain pointers is known in advance) then restarts the operation.

It seems to me that, if you are dealing with a pure functional language where the second approach is applicable, it must be more efficient than the first approach, as well as easier to mix with handwritten C.

Are there any problems I'm overlooking? Any references to existing discussion or implementations of this technique?

Trifurcate answered 8/8, 2011 at 8:43 Comment(9)

Possibly not helpful, but I tried the first whilst writing mark-sweep for my scheme interpreter. The performance sucked, so I ended up with a purely virtual stack outside the stack of the C runtime, mainly as cross-runtime stack introspection is virtually impossible. The performance also sucked but it was easier to debug without gdb/ddd. I decided to make do as this was the interpreter and tackle it when I got to the compiler stage of implementation (which never got finished typically). – Dasie 8/8, 2011 at 8:57

How do you plan to restart the current operation? Save checkpoints from time to time, then restore the last good one (how?) – Lashondra 8/8, 2011 at 9:7

@n.m.: the important part of the question in that respect is "code has no side effects". The questioner is assuming a pure functional language, so no state is ever modified. There's no need to "take" a checkpoint, and when you jump to a previous state you don't need to "undo" any changes because the language is not capable of making changes. In principle, your position in the code tells you everything you need to know about the state of the program. – Cheery 8/8, 2011 at 9:14

@n.m. That's a good question. It's easy to imagine for a byte-code interpreter, just longjmp back to eval(). But for compiled, I'm not sure. Surely you don't want to put setjmps around every allocation! – Moist 8/8, 2011 at 9:17

@luser droog: Or you could in effect put a setjmp after each return from a function in the functional language. That's when variables go out of scope, so anything collectable now was collectable at the last such point. The questioner seems to suggest only setjmp in the main interpret loop, I presume because that's at the top of the stack and he has in mind that therefore he doesn't need to worry about accurate vs conservative marking on the stack. – Cheery 8/8, 2011 at 9:26

I don't want to be constantly issuing setjmp calls, that could easily eat up much of the performance gain. I'm hoping to do something along the lines of having just one setjmp before the main loop, and then each time around the loop, only having to save the current program counter in a global variable. – Trifurcate 8/8, 2011 at 9:35

@Steve Jessop: the stack contains some information that is lost in longjmp. You need to save and restore it somehow, or else why have it in the first place? – Lashondra 8/8, 2011 at 9:49

My thinking is that you don't need to save it, because you can recreate it from scratch, just restart the entire operation. Let's say that involves redoing a millisecond of work, but garbage collection only runs once per second, the time spent on redoing work will still be negligible. – Trifurcate 8/8, 2011 at 10:1

What if your entire program is one function that runs for 10 seconds and hits OOM in the middle? Will you restart your entire program endlessly? – Gaiser 9/4, 2018 at 14:48

I think the most obvious thing you haven't described is how to handle persistent out-of-memory. As you've described it, suppose I have an operation that uses more memory than is available. Eventually I reach:

Start whatever stage of the operation takes us over the limit.
Run for a while.
Run out of memory.
Free all the memory allocated by this stage (there is nothing else to free).
Go to 1.

That is, an infinite loop. So at the least you want some sort of generational garbage collection that will allow you to detect the loop and exit.

Cheery answered 8/8, 2011 at 9:17 Comment(0)

It is possible to design a pure FP language using a single data structure:

typedef enum record_type { RT_SYMBOL, RT_NUMBER, RT_PAIR };

struct record
{
  record_type type;
  void *value;  
};

Programs and data can be represented using pairs of records:

struct pair
{
  record *car;
  record *cdr;
};

Here is how a simple expression - 2 * 3 - could be represented using records:

record r1;
r1.type = RT_NUMBER;
r1.value = &two; 

record r2;
r1.type = RT_NUMBER;
r1.value = &three; 

record opr1;
opr1.type = RT_NUMBER;
opr1.value = &OP_MULT; /* A machine op-code for multiplication. */

pair p_oprnds;
p_oprnds.car = &r1;
p_oprnds.cdr = &r2;

pair p;
p.car = opr1;
p.cdr = p_oprnds;

This is the same as the Lisp expression: (* 2 3). Now you can define a machine that operates on pairs, treating the car as an operator and the cdr as operands. As we deal with only one data structure, precise GC is possible. See Lispkit Lisp for the architecture of such a VM.

Also read Lisp in Small Pieces before starting off with a serious attempt on writing an FP -> C compiler.

Grieg answered 8/8, 2011 at 9:11 Comment(0)

Start whatever stage of the operation takes us over the limit.
Run for a while.
Run out of memory.
Free all the memory allocated by this stage (there is nothing else to free).
Go to 1.

That is, an infinite loop. So at the least you want some sort of generational garbage collection that will allow you to detect the loop and exit.

Cheery answered 8/8, 2011 at 9:17 Comment(0)

Recommended topics

Hot tags