Garbage collection implementation in compiled languages

Asked 17/8, 2011 at 23:35 Answered 11/9, 2011 at 15:36

Solved compiler-construction garbage-collection language-implementation

When implementing precise garbage collection, there is always the issue of figuring out which words on the stack are pointers and which are other kinds of data such as integers or floating point numbers. Interpreted languages typically solve this problem by making everything a pointer; compilers for some languages such as Lisp typically solve it by using tag bits to distinguish between pointers and integers.

But what about JIT compilers for languages such as Java and C# that support full unboxed machine word integers and floating-point numbers? How do they tell which of the contents of the stack and CPU registers are pointers?

Machuca answered 17/8, 2011 at 23:35 Comment(2)

Just wonder why the close votes? It's a genuine question. As for a possible implementation - type information is available statically, so all the pointers could go into a separate stack, for example. – Queeniequeenly 18/8, 2011 at 10:30

And a relevant link: mono-project.com/Generational_GC#Precise_Stack_Marking – Queeniequeenly 18/8, 2011 at 10:32

The bytecode for such languages always contains full type information. It is stored either in meta-data (e.g., for argument types) or implicit in the opcode (e.g., there may be different opcodes for adding an integer or a floating point number).

When optimising code the compiler can access this information and use it to improve optimisations. It also uses the information to generate meta data for the compiled code at specific GC safe points.

A GC safe point is a place in the code, where it is safe to interrupt a thread to schedule another thread or perform garbage collection. At GC safe points we have the necessary meta data available to find out which registers contain pointers and which don't. In the Hotspot JVM, for example, a loop always contains a read from a special location in memory. The result of that read is unused, but if the address that the instruction reads from is read-protected, a page fault occurs. This can be used to interrupt a thread at arbitrary points in time by simply setting that page to read-only. Once the thread is interrupted we look at the program counter and look up the meta data in, say, a hash table.

Other places that need to be GC safe points are allocation sites: an allocation may fail and cause GC to occur. You can reduce the number of safe points by allocating memory for multiple objects at once.

Edit: Note that using GC-safe points is only one of many options. Another option, as SK-logic mentioned, is to use separate stacks for pointers and non-pointers. It is then clear that all elements of one stack need to be traversed during GC but none of the others. You still have to be careful with pointers in registers, though. For example, whenever there is a live pointer in a register, the same pointer must also exist on the stack.

A third option is to use a shadow stack that contains a linked list of pointers to stack roots living on the real stack. For details see the paper "Accurate Garbage Collection in an Uncooperative Environment" by Fergus Henderson (PDF).

Cackle answered 11/9, 2011 at 15:36 Comment(0)

Languages like Java and C# are specified in such a way that they do not require precise collection. An implementation might use a conservative collector, where patterns of bits that appear to look like a pointer are treated like a pointer (but might really be an integer or float). For example, the Boehm collector is a conservative collector that could be used for JIT-ed languages.

Billups answered 17/8, 2011 at 23:41 Comment(1)

Sure, Mono used to use Boehm. But typical implementations (Sun Java, Microsoft .Net, latest version of Mono etc) do use precise garbage collection. I'm interested in how. – Machuca 17/8, 2011 at 23:44

Recommended topics

Hot tags