Get Control flow graph from Abstract Syntax Tree
Asked Answered
D

6

16

I have an AST derived from the ANTLR Parser Generator for Java. What I want to do is somehow construct a control flow graph of the source code, where each statement or expression is a unique Node. I understand there must be some recursiveness to this identification, I was wondering what you would suggest as the best option and if ANTLR has a toolset I can use for this job. Cheers, Chris


EDIT - My main concern is to get a control flow graph(CFG) from the AST. This way I can get a tree representation of the source. To clarify, both the source code and the implementation language is Java.

Dosi answered 18/9, 2008 at 13:30 Comment(6)
You should put clarifications of your question in the question, so that the answers can reflect your question, and not be in the comments.Delenadeleon
"CFG .... get a tree representation from the source..."?? If you parse the source code, you get a tree representation. A CFG would produce a graph that connected the AST nodes together.Jamarjamb
@irabaxter CFG is not the same things as a CFG. CFG pertains to how execution progresses. AST is simply a tree representation of the written code - in other words it is not a representation of the execution flow.Scratches
@ekeyeser: "CFG is not same thing as CFG" is a contradiction. Yes, I know a CFG is about ordering computations; my point to OP is that the computations being ordered are defined by the AST. Check my bio.Jamarjamb
@irabaxter No need to check your bio. I'm confident you know what you're talking about. I was commenting on the "??" which might strike people as arrogant. If someone is asking a question it's best not to use their own words against them which it seems you do a lot quite frankly.Scratches
SO is about getting good answers. If you don't want somebody to point out an inaccuracy, don't say inaccurate things. I put "??" to point out the inaccuracy. Sorry if you are feeling singed, but any reading of arrogance into that isn't mine.Jamarjamb
D
13

Usually CFGs are computed on a lower-level representation (e.g. JVM bytecode). Someone did a thesis on such things a few years ago. There might be a helpful way described in there for how to get at that representation.

Since your source and target languages are the same, there's no code generation step -- you're already done! However, now you get to walk the AST. At each node of the AST, you have to ask yourself: is this a "jumping" instruction or not? Method calls and if statements are examples of jumping instructions. So are loop constructs (such as for and while). Instructions such as addition and multiplication are non-jumping.

First associate with each java statement a node in the CFG, along with an entry and exit node. As a first approximation, walk the tree and:

  1. if the current statement is a method call, figure out where the entry node is for the corresponding body of that method call, and make an edge pointing from the current statement to that entry node. if the statement is a method return, enumerate the places that could have called it and add an edge to those.
  2. for each non-jumping statement, make an edge between it and the next statement.

This will give you some kind of CFG. The procedure is slightly hairy in step 2 because the method called may be declared in a library, and not elsewhere in the AST -- if so, either don't make an edge or make an edge to a special node representing the entry to that library method.

Does this make sense?

Delenadeleon answered 18/9, 2008 at 15:30 Comment(3)
The thesis you link to is about visualizing CFG:s not generating them.Skiing
This doesn't address the control flow induced by the "x?y:z" operator, nor does it address exception handling links.Jamarjamb
I've added some more examples.Delenadeleon
J
6

Producing a full control flow graph that really takes into account all the language issues is harder than it looks. Not only do you have to identify what appears to be the "basic blocks", but you have to identify function calls (sort of easy, but identifying the target might be harder), where behind-the-scenes operations such as class initializers can happen. and to worry about the points where exceptions can occur and where control goes if an exception does occur.

If you examine most languages carefully, they will also be clear about ordering of evaluation of computations in expressions, and this matters if you have two side-effects in an expression; the control flow should reflect the order (or the non-order, if it isn't defined).

Maybe you only want an abstraction of the control flow having the basic blocks and the conditionals. That's obviously a bit easier.

In either case (simple CFG or full CFG), you need to walk the AST, at each point having a reference to possible control flow targets (e.g., for most cases, such as IF statements, there are two flow targets: the THEN and ELSE clauses). At each node, link that node to the appropriate control flow target, possibly replacing the flow targets (e.g., when you encounter an IF).

To do this for the full language semantics of Java (or C) is quite a lot of work. You may want to simply use a tool that computes this off-the-shelf. See http://www.semanticdesigns.com/Products/DMS/FlowAnalysis.html for what this really looks like, coming out of our tools.

Jamarjamb answered 17/6, 2009 at 8:46 Comment(0)
D
1

Based on some comments, it sounds like the OP really wants to do code generation -- to convert the AST into a lower-level sequence of instructions based on basic blocks and jump points.

Code generation is very language-specific, and a lot of work has been put into this topic. Before you do code generation you need to know your target language -- whether it be assembler or simply some other high-level language. Once you have identified this, you simply need to walk the AST and generate a sequence of instructions that implements the code in the AST. (I say this is simple, but it can be hard -- it's hard to generalise because the considerations here are pretty language-specific.)

The representation you choose for code generation will contain the control-flow graph, implicitly or explicitly. If your target language is fairly low-level (close to assembler), then the control-flow graph should be relatively easy to extract.

(Please comment if you'd like more clarification.)

Delenadeleon answered 18/9, 2008 at 14:11 Comment(4)
I agree that knowledge of the target language (Java) is imperative. I'm looking for some insight as to how to approach the AST walk into a form that implicitly holds the control flow graph. Any suggestions?Dosi
If you know how to generate the Java, then to make a CFG from the java: make a node for each statement that is not a method call into your program. For method calls, draw an edge to the entry of the body for that method.Delenadeleon
In general this is a hard task, even if I knew your source language, which I don't. You just have to ... come up with a mapping of your source language constructs into Java.Delenadeleon
This isnt "code generation". OP wants to build a CFG. He does that by walking the AST, identifying control flow branching points, and linking them together.Jamarjamb
W
-1

Did you ever tryed ANTLR Studio? It does not generate the hole AST graph, but for review, its already quite helpful.

Winzler answered 18/9, 2008 at 13:35 Comment(2)
ANTLR Studio is basically a language editor for the ANTLR's automatically generated parsers. I have the parsers and lexers. What I need is a way to manipulate the AST. Any thoughts?Dosi
@user5915: why don't you make another question that asks about how to do what you want for manipulation? This question is about CFGs. (Yes, see my bio if you want a way to manipulate ASTs. Not with ANTLR).Jamarjamb
D
-1

When I have done this in the past, I used graphviz, in particular the dot tool, to generate the graph. I created the dot input file by actually traversing the control-flow graph at compile time.

Graph layout is a hard problem, and graphviz does an excellent job. It can output to ps, pdf, and various image formats, and the layout is usually pretty intuitive to look at. I highly recommend it.

Delenadeleon answered 18/9, 2008 at 13:51 Comment(3)
I would be more interested as to how you traversed the control flow graph at compile time, rather than the actual visualization of the graph once it's constructed. CheersDosi
Usually at this point you have generated fairly low-level code that consist of non-jumping instructions and jumping instructions. The former correspond to CFG nodes, and the latter contains implicit edges (the places to jump to). See alsohttp://en.wikipedia.org/wiki/Control_flow_graph.Delenadeleon
You may want to read up on "code generation": en.wikipedia.org/wiki/Code_generation_(compiler) -- this is the process of going from your AST to some lower-level representation, and this usually precedes the construction of the CFG.Delenadeleon
S
-1

I don't think I'll be able to answer your question in a way that you are maybe looking for since I don't know of any way in ANTLR to produce a CFG with or without an AST. But, in short you would use what ANTLR produces to generate a separate Java program to produce a CFG. You would utilize the ANTLR generated syntax tree as input to generate your CFG in a separate Java program of your own creation. At this point you are, in essence, building a compiler. The difference between your "compiler" and a JVM is that your output is a visual representation (CFG) of how a program branches its various execution paths and a JVM/Java compiler produces code for execution on a real machine (CPU).

An analogy is if someone sits down to write a book (in English for example), the individual words used in sentences are the TOKENS of a computer language, sentences are formed in a similar manner that context free grammars express valid computer code, and paragraphs & whole novels tell a story in a similar manner that semantic analysis/compilers/CFGs might produce/represent logically valid programs that actually do something useful and are more or less free of logic bugs. In other words, once you get past the issue of valid syntax (correct sentence structure), anyone can write a bunch of sentences on a page but only certain combinations of sentences produce text that actually does something (tell a story).

What you're asking about is that last piece - how to go about taking a syntax tree and transforming or interpreting what the AST actually does logically. And of course you would need to build a "compiler" for each language you want to do this for. Having a correct grammar doesn't tell you what a program does - just that a program is correct from a grammar perspective.

Linters and syntax highlighters and IDEs are all built around the idea of trying to make this last piece of the puzzle an easier and a more efficient task for humans.

Scratches answered 31/8, 2018 at 3:15 Comment(3)
ANTLR doesn't offer anything to support building the CFG. No, this isn't compiling; its a part of the front end of a compiler. OP may want to build a compiler, but he didn't say one way or another in his question.Jamarjamb
Sure. But I never said ANTLR does support building CFGs. The tree ANTLR creates can be used to generate a CFG though. I'm using the term compiling here loosely only to illustrate a point if you consider a compiler generically meaning a program that translates a set of instructions from one language to another. For what it's worth, the OP might not have a full grasp of what he/she needs/wants which is the point of asking a question.Scratches
Whatever you said above, didn't materially answer OP's question. This is not a useful answer.Jamarjamb

© 2022 - 2025 — McMap. All rights reserved.