Identify loops in java byte code
Asked Answered
T

4

18

I am trying to instrument java byte code.

I want to recognize the entry and exit of a java loop, but I have found the identification of loops to be quite challenging. I have spent a good few hours looking at ASM and open source de-compilers (whom I thought must solve this problem all the time), however, I came up short.

The tool I am augmenting / extending, is using ASM, so ideally I would like to know how to instrument the entry and exit of the different loop constructs in java via ASM. However, I would also welcome a recommendation for a good open source de-compiler, as clearly they would have solved the same problem.

Tips answered 22/7, 2011 at 15:28 Comment(0)
P
21

EDIT 4: A bit of background/preamble.

  • "The only way to jump backward in the code is via a loop." in Peter's answer isn't strictly true. You could jump back and forth without it meaning it's a loop. A simplified case would be something like this:

    0: goto 2
    1: goto 3
    2: goto 1
    

    Of course, this particular example is very artificial and a bit silly. However, making assumptions as to how the source-to-bytecode compiler is going to behave could lead to surprises. As Peter and I have shown in our respective answers, two popular compilers can produce a rather different output (even without obfuscation). It rarely matters, because all of this tends to be optimised rather well by the JIT compiler when you execute the code. This being said, in the vast majority of cases, jumping backwards will be a reasonable indication as to where a loop starts. Compared with the rest, finding out the entry point of a loop is the "easy" part.

  • Before considering any loop start/exit instrumentation, you should look into the definitions of what entry, exit and successors are. Although a loop will only have one entry point, it may have multiple exit points and/or multiple successors, typically caused by break statements (sometimes with labels), return statements and/or exceptions (explicitly caught or not). While you haven't given details regarding the kind of instrumentations you're investigating, it's certainly worth considering where you want to insert code (if that's what you want to do). Typically, some instrumentation may have to be done before each exit statement or instead of each successor statement (in which case you'll have to move the original statement).


Soot is a good framework to do this. It has a number of intermediate representations that make bytecode analysis more convenient (e.g. Jimple).

You can build a BlockGraph based on your method body, for example an ExceptionalBlockGraph. Once you've decomposed the control flow graph into such a block graph, from the nodes, you should be able to identity the dominators (i.e. blocks that have an arrow coming back to them). This will give you the start of the loop.

You may find something similar done in sections 4.3 to 4.7 of this dissertation.

EDIT:

Following the discussion with @Peter in comments to his answer. Talking the same example:

public int foo(int i, int j) {
    while (true) {
        try {
            while (i < j)
                i = j++ / i;
        } catch (RuntimeException re) {
            i = 10;
            continue;
        }
        break;
    }
    return j;
}

This time, compiled with the Eclipse compiler (no specific option: simply autocompilation from within the IDE). This code hasn't been obfuscated (apart from being bad code, but that's a different matter). Here is the result (from javap -c):

public int foo(int, int);
  Code:
   0:   goto    10
   3:   iload_2
   4:   iinc    2, 1
   7:   iload_1
   8:   idiv
   9:   istore_1
   10:  iload_1
   11:  iload_2
   12:  if_icmplt   3
   15:  goto    25
   18:  astore_3
   19:  bipush  10
   21:  istore_1
   22:  goto    10
   25:  iload_2
   26:  ireturn
  Exception table:
   from   to  target type
     0    15    18   Class java/lang/RuntimeException

There is a loop between 3 and 12 (jumped in starting a 10) and another loop, due to the exception occurring from the division by zero at 8 to 22. Unlike the javac compiler result, where one could make as guess that there was an outer loop between 0 and 22 and an inner loop between 0 and 12, the nesting is less obvious here.

EDIT 2:

To illustrate the kind of problems you may get with a less awkward example. Here is a relatively simple loop:

public void foo2() {
    for (int i = 0; i < 5; i++) {
        System.out.println(i);
    }
}

After (normal) compilation within Eclipse, javap -c gives this:

public void foo2();
  Code:
   0:   iconst_0
   1:   istore_1
   2:   goto    15
   5:   getstatic   #25; //Field java/lang/System.out:Ljava/io/PrintStream;
   8:   iload_1
   9:   invokevirtual   #31; //Method java/io/PrintStream.println:(I)V
   12:  iinc    1, 1
   15:  iload_1
   16:  iconst_5
   17:  if_icmplt   5
   20:  return

Before doing anything within the loop, you jump straight from 2 to 15. Block 15 to 17 is the header of the loop (the "entry point"). Sometimes, the header block could contain far more instructions, especially if the exit condition involves more evaluation, or if it's a do {} while() loop. The concept of "entry" and "exit" of a loop may not always reflect what you'd write sensibly as Java source code (including the fact that you can rewrite for loops as while loops, for example). Using break can also lead to multiple exit points.

By the way, by "block", I mean a sequence of bytecode into which you can't jump and out of which you can't jump in the middle: they're only entered from the first line (not necessarily from the previous line, possibly from a jump from somewhere else) and exited from the last (not necessarily to the following line, it can jump somewhere else too).

EDIT 3:

It seems that new classes/methods to analyse loops have been added since last time I had looked at Soot, which make it a bit more convenient.

Here is a complete example.

The class/method to analyse (TestLoop.foo())

public class TestLoop {
    public void foo() {
        for (int j = 0; j < 2; j++) {
            for (int i = 0; i < 5; i++) {
                System.out.println(i);
            }
        }
    }
}

When compiled by the Eclipse compiler, this produces this bytecode (javap -c):

public void foo();
  Code:
   0:   iconst_0
   1:   istore_1
   2:   goto    28
   5:   iconst_0
   6:   istore_2
   7:   goto    20
   10:  getstatic   #25; //Field java/lang/System.out:Ljava/io/PrintStream;
   13:  iload_2
   14:  invokevirtual   #31; //Method java/io/PrintStream.println:(I)V
   17:  iinc    2, 1
   20:  iload_2
   21:  iconst_5
   22:  if_icmplt   10
   25:  iinc    1, 1
   28:  iload_1
   29:  iconst_2
   30:  if_icmplt   5
   33:  return

Here is a program that loads the class (assuming it's on the classpath here) using Soot and displays its blocks and loops:

import soot.Body;
import soot.Scene;
import soot.SootClass;
import soot.SootMethod;
import soot.jimple.toolkits.annotation.logic.Loop;
import soot.toolkits.graph.Block;
import soot.toolkits.graph.BlockGraph;
import soot.toolkits.graph.ExceptionalBlockGraph;
import soot.toolkits.graph.LoopNestTree;

public class DisplayLoops {
    public static void main(String[] args) throws Exception {
        SootClass sootClass = Scene.v().loadClassAndSupport("TestLoop");
        sootClass.setApplicationClass();

        Body body = null;
        for (SootMethod method : sootClass.getMethods()) {
            if (method.getName().equals("foo")) {
                if (method.isConcrete()) {
                    body = method.retrieveActiveBody();
                    break;
                }
            }
        }

        System.out.println("**** Body ****");
        System.out.println(body);
        System.out.println();

        System.out.println("**** Blocks ****");
        BlockGraph blockGraph = new ExceptionalBlockGraph(body);
        for (Block block : blockGraph.getBlocks()) {
            System.out.println(block);
        }
        System.out.println();

        System.out.println("**** Loops ****");
        LoopNestTree loopNestTree = new LoopNestTree(body);
        for (Loop loop : loopNestTree) {
            System.out.println("Found a loop with head: " + loop.getHead());
        }
    }
}

Check the Soot documentation for more details on how to load classes. The Body is a model for the body of the loop, i.e. all the statements made from the bytecode. This uses the intermediate Jimple representation, which is equivalent to the bytecode, but easier to analyse and process.

Here is the output of this program:

Body:

    public void foo()
    {
        TestLoop r0;
        int i0, i1;
        java.io.PrintStream $r1;

        r0 := @this: TestLoop;
        i0 = 0;
        goto label3;

     label0:
        i1 = 0;
        goto label2;

     label1:
        $r1 = <java.lang.System: java.io.PrintStream out>;
        virtualinvoke $r1.<java.io.PrintStream: void println(int)>(i1);
        i1 = i1 + 1;

     label2:
        if i1 < 5 goto label1;

        i0 = i0 + 1;

     label3:
        if i0 < 2 goto label0;

        return;
    }

Blocks:

Block 0:
[preds: ] [succs: 5 ]
r0 := @this: TestLoop;
i0 = 0;
goto [?= (branch)];

Block 1:
[preds: 5 ] [succs: 3 ]
i1 = 0;
goto [?= (branch)];

Block 2:
[preds: 3 ] [succs: 3 ]
$r1 = <java.lang.System: java.io.PrintStream out>;
virtualinvoke $r1.<java.io.PrintStream: void println(int)>(i1);
i1 = i1 + 1;

Block 3:
[preds: 1 2 ] [succs: 4 2 ]
if i1 < 5 goto $r1 = <java.lang.System: java.io.PrintStream out>;

Block 4:
[preds: 3 ] [succs: 5 ]
i0 = i0 + 1;

Block 5:
[preds: 0 4 ] [succs: 6 1 ]
if i0 < 2 goto i1 = 0;

Block 6:
[preds: 5 ] [succs: ]
return;

Loops:

Found a loop with head: if i1 < 5 goto $r1 = <java.lang.System: java.io.PrintStream out>
Found a loop with head: if i0 < 2 goto i1 = 0

LoopNestTree uses LoopFinder, which uses an ExceptionalBlockGraph to build the list of blocks. The Loop class will give you the entry statement and the exit statements. You should then be able to add extra statements if you wish. Jimple is quite convenient for this (it's close enough to the bytecode, but has a slightly higher level so as not to deal with everything manually). You can then output your modified .class file if needed. (See the Soot documentation for this.)

Petro answered 22/7, 2011 at 15:47 Comment(9)
that does make it seems quite a challenge, however decompilers must already solve this, at least to some degree, no?Tips
Well, Soot is effectively a decompiler, or at least a framework to do this (it even has a decompiler front-end, Dava). There are tutorials for Soot here: sable.mcgill.ca/soot/tutorial/index.html. It also depends on what kind of instrumentation you're after. Java Decompilation cannot always be perfect. (Bytecode analysis is generally not necessarily easy anyway.)Petro
Is instrumenting the successor a feasible option when an exception is thrown from a loop?Gaskell
@biziclop: yes, it's possible, for example by writing a new block of instructions that catches the exception, does the instrumentation and then throws the exception again (in the same way a try/finally would work). This may lead to duplicate code (for the normal and the exceptional case): there are effectively two successors in this case.Petro
Fair point, although it does sound very complicated to implement correctly. (Considering there might already be a finally block which can end abruptly, and exceptions which are caught on the spot while others that are allowed to propagate further up.)Gaskell
@biziclop: when I did that work with AspectJ, more precisely the AspectBench Compiler (perhaps I should have made it clearer I wrote the dissertation I'm referring to), this was done by the rest of the weaver. You could choose an after, after throwing or after returning pointcut (AspectJ syntax) to choose when to use the aspect. That's something systematically done by the weaver anyway (including for more traditional pointcuts, e.g. call). Reasoning in terms of blocks (and via Jimple) also makes it a bit easier to implement.Petro
Keeping track of where the weaving points were (i.e. not inserting code within the loop) was one of the tricky parts (because as soon as you modify the bytecode, you can effectively modify the structure of the loop), even more so for building closures for around advice (if I remember well, I left the field a while back...). In general, instrumenting (or "weaving") areas/blocks of bytecode, rather than a single instruction is more difficult than spotting a single instruction a putting something quickly before and/or after.Petro
More specifically on the exceptions, the order in the exception table matters, so it's important to insert the entry at the right place when instrumenting. That's one of the reasons some of the blocks of bytecode may not always end up being organised as linearly as one may think.Petro
Just to note, this answer has been extremely helpful in providing some great references for Soot and analyzing the block graph. Thanks for such an in-depth, well-written answer.Kate
P
6

The only way to jump backward in the code is via a loop. So you are looking for a goto,if_icmplt etc which goes to a previous byte code instruction. Once you have found the end of the loop and where it jumps back to is the start of the loop.


Here is a complex example, from the document Bruno suggested.

public int foo(int i, int j) {
    while (true) {
        try {
            while (i < j)
                i = j++ / i;
        } catch (RuntimeException re) {
            i = 10;
            continue;
        }
        break;
    }
    return j;
}

The byte-code for this appears in javap -c as

public int foo(int, int);
  Code:
   0:   iload_1
   1:   iload_2
   2:   if_icmpge       15
   5:   iload_2
   6:   iinc    2, 1
   9:   iload_1
   10:  idiv
   11:  istore_1
   12:  goto    0
   15:  goto    25
   18:  astore_3
   19:  bipush  10
   21:  istore_1
   22:  goto    0
   25:  iload_2
   26:  ireturn
  Exception table:
   from   to  target type
     0    15    18   Class java/lang/RuntimeException

You can see there is an inner loop between 0 and 12, a try/catch block between 0 and 15 and an outer loop between 0 and 22.

Physics answered 22/7, 2011 at 15:38 Comment(12)
Agreed + note that there is sometimes a preamble of bytecodes setting up the loop vars.Periotic
Blocks of bytecode need not be organised "linearly", so it's not always obvious to know when you're going "backwards". You need to build a control-flow graph and do an analysis of the blocks.Petro
what about recursion? out if interest is this in your definition also a loop?Tips
@Bruno, the compiler does very few optimisations. This is does by the JIT. This makes translating the byte code to java code simpler than you might expect.Physics
IMHO, recursion != loops and they don't look anything like each other in the byte code.Physics
@Peter, it's not about the optimisations of the compiler or the JIT: it's about generalising this to any byecode. There can be valid bytecode with (possibly very unoptimised) chunks of bytecode that make it jump back and forth across the body. It also gets more complex with nested loops and blocks of code that can handle exceptions. Going "backwards" isn't as easy to detect as it seems if you're just given some valid bytecode.Petro
@Bruno, You can have valid byte code which is complete non-sense and cannot be decoded into Java. This is the sort of thing obfuscators do deliberately. In the most general terms its impossible to understand all possible valid byte code. I think it is reasonable to assume you want to decode the sort of byte code javac produces.Physics
@Peter, the question doesn't say anything about such assumption. You'll also have odd possible cases (depending on what the purpose of the instrumentation should be): there are simple, unusual but non-obfuscated examples in Listings 4.5 and 4.6 here for example.Petro
I can confirm that it will be targeting non-obfuscated examples. The goal is really to generate runtime visualizations from our own code, to help see what is this api or library or framework really doing? Static analysis cannot achieve this as you cannot be sure what is exercised at runtime, so we are striving to get a runtime picture. However, of course there is too much noise in the resulting diagrams caused by all sorts, from private method calls, to loops. So we are investigating ways to summarize the resulting analysis / visualizations. Thanks for your inputs.Tips
@Bruno, I tried to find the most complex example from that section and I have added it to my answer.Physics
@Peter, you're missing my point: the OP may end up with a compiler (possibly different from javac) that may want to do weird and wonderful things without any specific reason. I can't remember any specific example, but some cases happen without handcrafting the bytecode or any voluntary obfuscation. Try with a few nested loops and inner/outer break/continues from various conditions. Making assumptions like this when instrumenting bytecode is likely to make you miss certain loops. Usually, you instrument bytecode precisely to make sure you're not missing anything.Petro
@Peter, see my edited answer, with the output from Eclipse. Loop detection isn't as easy as it may seem.Petro
T
0

Are you actually building your class byte by byte? Thats pretty wild! The front page of ASM links to the Bytecode Outline plugin for Eclipse, which I assume you are using. If you click on the first image on there you will notice the code has a while loop, and you can see at least some of the byte code used to implement that loop. For reference here is that screenshot:

Bytecode Outline Screenshot

Direct link

Looks like loops are simply implemented as GOTO's with a boundary check. I'm talking about this line:

L2 (173)
  GOTO L3

I'm sure the L3 marker has code for checking the index bound and decided wether to JMP. I think this is going to be quite hard for you if you want to instrument loops one byte code at a time. ASM does have the option of using a template class as the basis for you instrumentation, have you tried using it?

Tonytonya answered 22/7, 2011 at 15:44 Comment(2)
I am not building byte by byte, rather I am instrumenting existing code, but want to add instrumentation to notify a profiler / visualizer that a loop is entering and exiting (with support for nesting / depth). I have looked at the byte code and recognize that I am looking for a goto with some guard condition. However, I was wondering if there was an existing pattern in ASM or existing de-compiler source code. I assumed it was something that had been done before and don't want to re-implement that part of the byte code parsing myself ( for identifying start of loop X, end of loop X ) etc.Tips
@Tips - ah ok that makes alot more sense. I don't see anything in the ASM docs that supports that exact scenario, you might have better luck with the open source decompilers.Tonytonya
B
0

I know this is an old question - however, there was specific interest in how this would be achievable with the ASM library, and this may be of use to future visitors. Bearing in mind the caveats other answers give warning against generalized assumptions related to the "goto" statement, there is a way to do that. (This assumes that any grouping of code within a given method that can "loop" should be detected - usually this is an actual loop construct, but other (rare, but present) examples have been provided of how this can occur.)

The main thing you'd need to do is keep track of the "labels" (locations in the byte code) that ASM visits prior to what it terms a "jump instruction" - if the label it jumps to has already been encountered in the context of the same method, then you have a potential for looping code.

A notable difference I saw between the answers here and how ASM behaved is that it read the "looping" jump commands for a simple file as opcodes other than "goto" - this may be just changes in Java compilation in the time since this was asked, but seemed worth noting.

The basic example code for ASM is this (this is entered via the checkForLoops method):

import org.objectweb.asm.ClassReader;
import org.objectweb.asm.ClassVisitor;
import org.objectweb.asm.Label;
import org.objectweb.asm.MethodVisitor;
import org.objectweb.asm.Opcodes;

public void checkForLoops(Path classFile) {
    LoopClassVisitor classVisitor = new LoopClassVisitor();

    try (InputStream inputStream = Files.newInputStream(classFile)) {
        ClassReader cr = new ClassReader(inputStream);

        cr.accept(classVisitor, 0);
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

public class LoopClassVisitor extends ClassVisitor {

    public LoopClassVisitor() {
        super(Opcodes.ASM7);
    }

    @Override
    public MethodVisitor visitMethod(int access, String name, String descriptor, String signature,
            String[] exceptions) {
        return new LoopMethodVisitor();
    }

}

public class LoopMethodVisitor extends MethodVisitor {

    private List<Label> visitedLabels;

    public LoopMethodVisitor() {
        super(Opcodes.ASM7);

        visitedLabels = new ArrayList<>();
    }

    @Override
    public void visitLineNumber(final int line, final Label start) {
        System.out.println("lnLabel: " + start.toString());

        visitedLabels.add(start);
    }

    @Override
    public void visitLabel(final Label label) {
        System.out.println("vLabel: " + label.toString());

        visitedLabels.add(label);
    }

    @Override
    public void visitJumpInsn(final int opcode, final Label label) {
        System.out.println("Label: " + label.toString());

        if (visitedLabels.contains(label)) {
            System.out.println("Op: " + opcode + ", GOTO to previous command - possible looped execution");
        }
    }

}

You could additionally attach line number information when available to the labels, and track that within the method visitor, to output where the detect loops start and end within source.

Banderole answered 21/11, 2019 at 7:30 Comment(1)
Something to note: if you are trying to detect loops to find code iterated over specifically, you'll also need to find places Java stream methods such as map and filter are used - for this, you'd want to look into the "invoke dynamic" method in method visitor, and the possible values of the "bootstrap method arguments"Banderole

© 2022 - 2024 — McMap. All rights reserved.