Create control flow graph from assembly file
Asked Answered
B

4

8

I would like to create control flow graph(CFG) from assembly file using C language. I have been thinking about it and these are my ideas: 1. create blocks - process assembly file line by line - find important instructions like function name, block name, jump instructions, call instructions or leave/ret and maybe some others - find them with regular expressions maybe? But I have not found regex implementation for C on Windows yet. - after match of instructions above, save instructions before match to some structure and this is my block 2. create CFG - somhow from blocks create CFG but yet I have no idea

Can anyone give me some advice how to do it? Also if there is better language to do it I would appriciate if you tell me.
Thanks for your time and help.

Bullis answered 20/2, 2016 at 20:57 Comment(6)
It rather depends on how good of a job you need to do. Ad-hoc regular expressions in whatever language you are most comfortable with is a reasonable first approximation. You'll get into trouble with macros, external linkage, computed branches and the like but then a perfect solution is impossible to begin with so it all depends on how good of a heuristic you require.Professorship
If you do ad hoc parsing, you'll get at best ad hoc control flow graphs. Why you would want a bad CFG is beyond me; what would you do with it that is useful?Thomsen
What do you need this for? Maybe there's a different direction to solve the real problem behind this.Thomsen
I want to implement method called "Enhanced version of Control Flow checking using Assertions" to assembly source code.Bullis
C doesn't have any special advantage for this task. There are high-level languages that make text-parsing generally easier. I'd use perl, because that's what I know, but there are lots of good choices. Python is popular.Distillate
I have been thinking about using Java but stiil I am not sure how to do it.Bullis
T
4

What OP needs is a disciplined approach to accomplishing this task.

He needs a good parser for the assembler source, so he knows he has an accurate representation of it. Beyond the pure parse part, he'll have to emulate the assembler pretty completely, including all the complications, such as macros, conditional blocks, multiple location counters, absolute/relative/external symbols, etc. (Build a good parser by relying solely on regexes isn't going to work.)

He'll then need to compute first estimate of the control flow graph by inspecting the sequences of machine instructions and branches. This may be harder to do than it looks; in big, complicated assembly codes people abuse entry points into procedures so that it is sometimes hard to tell what's an instruction, and what is data.

(Here's a trick I use in a big x86 application. I want to add sanity checking to my code in many places. The sanity tests look like this:

  <test for some sane condition>
  jf  location+3       ; this branchs to a breakpoint in the middle of the next instruction
  cmp  al, 0xCC        ; the immediate value is a BREAKPOINT opcode

They're compact, and a breakpoint occurs when some bad happens. But when analyzing this program for control flow, the "jmp false" sometimes branches into what appears to be the middle of an instruction. How will OP model that?)

The next complication are pointers to code. Assembler code often generates many pointers to other other instructions, and then hides those pointers in various places (the call instruction pushes them onto the data stack for the x86), retrieves them, and then does "jmp indirect". If you want to know where that jmp might go, you need to track the possible values a memory location might contain, which means you need to do a data flow analysis (how do values get there, and from where) and combine with call graph construction (can't get to that function? OK, then where it goes won't affect this code) to compute an answer that is reasonable.

Doing all this by ad hoc methods will end up producing inaccurate (useless) answers. OP needs to find a framework in which to build his parser, and implement good quality points-to analysis algorithms if he hopes to get a good result.

C isn't specifically designed to support this task. It can do it with enough additional sweat, but that's true for any programming language.

(Check my bio for such a framework. OP can use any framework that works for him).

Thomsen answered 20/2, 2016 at 21:42 Comment(6)
Seems like lots of work to do. I want to do this for smaller programs which use easy algorithm just like quicksort or hanoi tower or maybe some program where to open file read from it and write to it or something like that. I get assembly source from C soure using GCC and I will use AT&T syntax. Would it be less complicated than it seems to me after I read your post(btw very interesting - I was not aware of all aspects of the task I am about to do)Bullis
There are no "easy" algorithms in assembler code. Quicksort uses recursion (code pointers in the stack) and partitioning of values to be sorted (often implemented with pointers rather array indexes in assembler).Thomsen
If you want more background, read my essay on "Life After Parsing"; search on web or see my bio.Thomsen
I kind of don't get it. When I have assembly file (instruction after instruction) where could by problem go instruction by instruction and look for instruction that tells me that there is end of block. I save all instruction befor this instruction to some structure. After I parse the file I can somhow create CFG going block after block and find successor(s) and predeccessor.Bullis
If assembly code was all about direct branches from one chunk of code to another, you construct a CFG in a relatively straighforward way. It isn't; what makes assembly powerful is unrestricted use of pointers to data and to code, with crazy indirections supporting efficiencies. In particular, a jmp indirect register can technically go to any memory location in your program's address space. You can't model that sensibly with a flowchart. So you have to figure out what can be in the register to limit the number of targets, ("points-to" analysis) and that's hard.Thomsen
... that's the easy case. Then then are jump tables with ill-defined boundaries, instructions in data space and vice versa. and then really crazy stuff where a pointer is modified in some very unusual way. (I write a complex runtime system in assembler for a parallel programming langauge; it is useful to mask code pointers to remove the bottom N bits where N has a lot to do with cache alignments. This would make a general points-to analyzer fail; it has to have knowledge of how the pointers are being manipulated and why. So, hard, plus special knowledge).Thomsen
P
3

A simpler approach would be to assembly the assembly file, then disassemble it. Most of the parsing problems are eliminated. A disassembly will have fixed columns for the label, op code, and operands, so very little parsing is required.

Using the disassembly, perform two passes. The first pass is to create a data structures to represent each instruction and to collect all of the jump targets.

The second pass is to create structures that represent basic blocks (a block of code with a single entry point and exit). Link each basic block to its successor(s). A basic block can have zero, one, or two successors (or N successors in the case of a jump table). A basic block that ends with RET has zero successors. A basic block ending in an unconditional jump has one successor. And a basic block with a conditional jump has two successors - either fall through or target of jump. Basic blocks without predecessors are either subroutine entry points or dead (or invalid) code.

Jump targets are the beginning of a basic block, as is an instruction following an unconditional jump (which should be a jump target or else a subroutine entry point).

Short of running the program (via real hardware or an emulator) knowing the targets of indirect jumps with certainty is infeasible. I suggest supporting a couple of simple cases: 1) a jump table where the table is within the program 2) a jump through a global memory location that used to link to another executable (and the disassembler lists what the target is). In the first case the basic block can have arbitrarily many successors. In the second case the basic block has a single successor.

Note that I'm deliberately leaving out CALLs as part of the CFG. When I implemented a CFG grapher that's what I did. My grapher showed just a single function at a time. If you double clicked a CALL then the subroutines's CFG was displayed.

However, if you want to include the whole subroutine tree in a single CFG, then CALLs would be the end of a basic block, and the instruction following a CALL would be the beginning of a basic block. Note that for anything but the simplest programs it will be hard to view the entire CFG for a program.

I'm leaving out INTs and IRETs because I'm assuming you are dealing with user mode applications. If not, then treat INTs like calls and IRETs like a RET. A hardware Interrupt Serivce Routine (ISR) could be invoked from anywhere where interrupts are enabled, so there won't (usually) be any direct invocations of an ISR - it will just kind of sit over to the side. In more general terms if you are dealing with kernel mode software, you'll have a variety of other considerations.

Plovdiv answered 21/2, 2016 at 2:47 Comment(4)
Disassembling won't resolve the jmps-to-middle of instruction. It may not even recognize when a memory region is actually code, or may misinterpret a data region incorrectly as containing code. This approach adds noise, and you don't avoid the hard problem of points-to analysis. What you;ve done is avoid the easiest part: parsing the assembly source code.Thomsen
Agner Fog's objconv disassembler (agner.org/optimize) already produces output with the branch targets marked. IDK what it does with jumps to the middle of an instruction (which you might find in obfuscated machine-language code). It's open source (I forget the license), so you could maybe even use that as a starting point.Distillate
@IraBaxter I suspect the OP is mostly concerned with non-assembler code (C code in particular). Clearly it's possible to code stuff in assembly that makes it impossible to generate an accurate CFG. However that doesn't mean CFGs can't still be usefully generated. IDA is a good example of a product that usefully generates CFGs.Dalliance
You mean, he is interested in disassembling vanilla compiled C code? That would generally make the behaviour of the object code probably a lot cleaner under normal circumstances, I agree. A C compiler designed to generate difficult to decode object code would make it a lot worse. (I've implemented an x86 compiler for parallel machine code with very small grain sizes; its code would be difficult to disassemble because it does a few odd tricks with data placement in instruction streams). YMMV, or more likely, HisMMV.Thomsen
M
3

I was facing the same issue as you and didn't find any ready made solutions so I wrote simple one on my own with Python: https://github.com/Kazhuu/asm2cfg.

Note that I have only tested it with function disassembly dumps from GDB. This could be extended to use with objdump I think.

Marji answered 11/2, 2021 at 13:58 Comment(0)
A
3

For a few years now Compiler-Explorer has this implemented:

enter image description here

Would give something like this: enter image description here

(see here).

In case this doesn't suffice and someone actually wishes to implement this themselves, here are some pointers for tracing the internal godbolt logic:

  1. Most of the logic is in the files under https://github.com/compiler-explorer/compiler-explorer/tree/main/lib/cfg.
  2. The split to basic blocks is done, well, in splitToBasicBlocks. isBasicBlockEnd and isJmpInstruction are implemented per-architecture.
  3. generateFunctionCfg creates the nodes+edges graph object.
  4. The visual layout engine is actually in the front-end, and the current implementation borrows heavily from https://cutter.re/docs/api/widgets/classGraphGridLayout.html - although we have some thoughts on the matter.
Anisometric answered 16/12, 2023 at 14:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.