How can I build the post-dominator tree of a function with an endless loop?
Asked Answered
R

1

18

One of my side projects is a decompiler that turns native code into LLVM IR, simplifies it and outputs pseudo-C. An essential phase of the program's process is pattern-independent control flow structuring, which finds regions in a program and turns them into structured control flow statements (i.e. not gotos).

I had to roll out my own code for finding regions because LLVM's regions don't look exactly how the paper expects them. However, finding regions requires you to have a post-dominator tree, and it turns out that LLVM's post-dominator tree building algorithm can't make one for functions that have no end block, like functions that "end" with an endless loop.

This appears to be because the tree-building algorithm needs a starting point. Normally, the starting point is the function's returning block, since it post-dominates every other block; but there isn't any in functions that spin into an endless loop, since they don't have any return or unreachable terminator. (At this point, it might be worth noting that LLVM's region code also relies on a post-dominator tree and is also useless for functions for which it can't be built.)

It seems to me that even though this algorithm fails, the fact that a function doesn't return doesn't mean that you can't make a post-dominator tree for it.1 In fact, if that endless loop has a single back-edge (which is something that I can ensure), the node with that back edge necessarily post-dominates every other node in the graph, so it should be possible to make a post-dominator tree.

If I could find that node, I could probably feed it to LLVM's post-dom infrastructure and get a reasonable post-dominator tree out of it. Unfortunately, I'm not very imaginative and the only straightforward way that I can think of to identify that crucial node is that "it's the one that post-dominates everything", which certainly won't help me bootstrap a post-dominator tree.

Finding back edges isn't particularly hard. As Doug Currie says, you can do it with a simple DFS, and in fact, another part of my project does exactly that. However, in the case of a function with an endless loop and nested, terminating loops, I don't know how I would tell the inner back edge from the outer back edge without domination information. (If it can help, at this stage of the process, every loop is guaranteed to have a single entry node and at most one exit node.)

So how can I build the post-dominator tree of a function that doesn't have any returning basic block?

1. My compiler and graph theory background is entirely self-taught. This might not be accurate.

Rosaleerosaleen answered 14/2, 2016 at 23:23 Comment(5)
As a hack, to make your analysis succeed, could you perhaps insert a fake end-condition in the loop, which never actually becomes true, e.g. in pseudo-code: if (5 == 7) break;?Conker
@500-InternalServerError, I still need to find out where to put it exactly, and that would be at the node with the back-edge, which I don't know how to find.Rosaleerosaleen
You can find back edges with DFS; see, e.g., cs.yale.edu/homes/aspnes/pinewiki/DepthFirstSearch.htmlEpithelioma
@DougCurrie, there's no guarantee that a function with an endless loop has just one loop. If the endless loop has an inner (terminating) loop, I don't know how to determine which back edge belongs to the endless loop. I should include that in the question.Rosaleerosaleen
@zneak, see "Identifying Loops In Almost Linear Time" by G RAMALINGAM, pages.cs.wisc.edu/~ramali/Papers/toplas99.ps which describes near linear time algorithms for finding loops and loop nesting.Epithelioma
M
8

Strictly speaking, when there's an infinite loop (not just an unbounded loop -- a strictly infinite one with no exit), there is no post-dominator, as for every node in the loop, the next node in the loop will execute after it, so there's no "last" node to the loop.

One way of dealing with it, is to do the normal post-dominator finding, except after doing the initial depth-first traversal from the exit, you check for unvisited nodes. If there are any, the exit is unreachable from them (no path from the nodes to the exit), so you pick one at random and add it as a psuedo-exit (as if the node contained a conditional branch to the exit, just the condition is always false) and restart. This gives you a post-dominator tree, but not necessarily a unique one (since you might choose different nodes at random to add the exit branch), but generally that doesn't matter.

Moyer answered 15/2, 2016 at 2:5 Comment(7)
This is actually what I was trying to get at with my comment, though I wasn't able to word is as well as you have.Conker
Does that work even when not every unreachable node is part of an endless cycle?Rosaleerosaleen
By definition, all unreachable nodes must be part of an endless cycle, since every node must have at least one successor (if only itself).Moyer
Seems to me that an unvisited node could unconditionally lead to an endless cycle without being part of it, no?Rosaleerosaleen
@zneak: Yes, but such a node must lead unconditionally to an endless cycle -- you can't have unvisited nodes without endless cycles. So once you've added never-taken branches that result in all nodes being reachble by the reverse-edge traversal from the exit, you have taken care of all the endless cycles. Its just that there are multiple ways you might do that (that may result in more or less efficient code), but all such ways are "correct" in the sense that they preserve the semantics of the original program.Moyer
I settled on this: (1) ask LLVM to make its post-dom tree; (2) find every back edge in the function; (3) check if every back edge destination has a tree node. If so, use LLVM's post-dominator tree. Otherwise, take the tree's root and add every back edge destination that didn't have a tree node as a root, and calculate a new post-dominator tree. It appears to work.Rosaleerosaleen
Some additional context for people who may stumble upon this: it's entirely true that you can add an exiting edge to any node in the loop and get correct results, but best results are achieved when you add them to a node that has a back-edge. This allows that node to post-dominate the header.Rosaleerosaleen

© 2022 - 2024 — McMap. All rights reserved.