Converting SSA to stack machine
Asked Answered
R

2

19

It is well known how to convert code from SSA representation to a register machine. (Basically, graph coloring register allocation is the core of such conversion.)

But what's the general method for converting from SSA to a stack machine? (CIL byte code, in the case I'm looking at.) I would expect it to be simpler, given the lack of need for register allocation?

Ruttish answered 14/7, 2018 at 14:33 Comment(1)
Perhaps this might be better suited for CS.se.Hierarchy
S
7

It's been over 15 years since I was involved in compiler construction, so I might not remember all the details.

Basically when going out of SSA, you need to generate load/store instructions into virtual registers at the end of all blocks leading to phi-nodes in successor blocks. This will lead to generating a number of virtual registers, which is usually higher than the available registers on the actual machine. So you apply register allocation on the local variables to come up with the real registers, spilling on the stack those values that don't fit.

For a stack-based machine, just don't do the last step. You end up with roughly the same number of virtual registers as phi-nodes in the compiled function (the algorithm is actually not trivial, a good starting point is the paper Efficiently Computing Single Static Assignment Form and the Control Dependence Graph by Ron Cytron, Jeane Ferrante, et. al.) These virtual registers will be your local variables.

When reading a value from a virtual register (local variable) to be used by an operation, first use an instruction to push it on the stack. The Java VM iload index instruction is such an example: it loads the local variable at index and pushed its value on the stack. (see https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.iload) When writing a value into a local variable, you pop it from the stack. See Java VM istore index instruction (see https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.istore).

For example, if after going out of SSA you need to encode

local 5 = MUL local[2], local[4]

then you need to generate something like this:

ILOAD 4 ILOAD 2 MUL ISTORE 5

For CIL bytecode, you have the equivalent ldarg and starg operations.

Of course, there is a lot of room for optimizations, to avoid redundant load/stores.

Shrove answered 23/7, 2018 at 12:49 Comment(0)
V
5

SSA is basically set of "logic" gates each with multiple inputs and typically one output.

So essentially you need to treat each gate as a set of stack pushes for the inputs, followed by a zero-operand operator that combines the stack values into the result for that gate. For instance, a + b * c as SSA with a multiply-and-accumulate operator has 3 pushes for a,b,c followed by a MAC_TOS operator.

If one has a chain of such gates, you can take the output of an earlier gate, which is already on the stack, and simply acts as if it has been pushed.

So, and SSA computation looks like an n-ary tree of gates with the output coming out at the root.

You can walk the tree in in-fix order, pushing operands that have not already been pushed, and generating a gate's operator when all operands have been computed.

So the SSA graph (tree):

a 
  \
   * 
b /  \
      +
c     /
  \  /
   -
  /
d

can be used to produce

push a
push b
times
push c
push d
subtract
times
Violent answered 17/7, 2018 at 3:48 Comment(7)
Right, if one only needed to translate expressions, that would suffice. But in general SSA a, b etc may be in arbitrary order, and may be function calls with side effects so order needs to be preserved, so you need to do a forward pass over each basic block to do things in the given order. How do you reconcile that with the above scenario of backward chaining in postfix order?Ruttish
Your question was phrased as basic insight, so my answer was at the same level: SSA is basically only expressions; it is the functional equivalent of chunks of your code. Real code generators lead more complex lives. If you have multiple expressions with shared subexpressions (e.g, a DAG with multiple results), you want to evaluate the shared subexpressions first (recursively) so that you can maintain a stack compatible ordering of computations; you may need a memory location to hold the shared result.Violent
You'll note that phi nodes add a complication because their value may come from very different computations. They too,might need a temporary location. Good news: you can assign temporaries using register coloring :-} .If you have side effects of an operator, then you will have to add additional sequencing arcs to the SSA nodes that models where ordering is necessary, and honor in the code generation process.Violent
This is an extremely lucid way of explaining it, good work.Sukhum
@IraBaxter Thanks, illuminating! What if there's no (heap) memory location to spill the results of shared subexpressions, but there are appropriate stack operators like dup, drop, swap, rot in FORTH? As the set of supported stack operatos vary across implementations, maybe the solution is similar to optimizing compilation or query optimization?Alleen
Assume you end up with N temps needing to spill. Then you get to treat the top N location of the stack as your N temps. There are N! ways to arrange the temps on top of the stack; you need to find out which one of those causes you to issue the fewest stack manipulation operations to bring the temps you want into top place in stack when needed. I don't know how to accomplish that optimization.Violent
... well, hum. Given that the number of shared subexpressions is probably small, you could probably do a brute force search of (abstracted) code sequences and determine the code with the smallest number of stack-top operations.Violent

© 2022 - 2024 — McMap. All rights reserved.