Limits of Klee (the LLVM program analysis tool)

Asked 21/4, 2011 at 10:11 Answered 27/6, 2023 at 19:38

Solved code-analysis llvm verification klee

http://klee.llvm.org/ is a program analysis tool that works by symbolic execution and constraint solving, finding possible inputs that will cause a program to crash, and outputting these as test cases. It's an extremely impressive piece of engineering that has produced some good results so far, including finding a number of bugs in a collection of open-source implementations of Unix utilities that had been considered among some of the most thoroughly tested software ever written.

My question is: what does it not do?

Of course, any such tool has the inherent limit that it can't read the user's mind and guess what the output was supposed to be. But leaving aside the in principle impossible, most projects don't yet seem to be using Klee; what are the limitations of the current version, what sort of bugs and workloads can it as yet not handle?

Zuniga answered 21/4, 2011 at 10:11 Comment(0)

As I can say after reading a http://llvm.org/pubs/2008-12-OSDI-KLEE.html

It can't check all possible paths of big program. It failed even on sort utility. The problem is a halting problem (Undecidable problem), and KLEE is a heuristic, so it is able to check only some of paths in limited time.

It can't work fast, according to paper, it needed 89-hours to generate tests for 141000 lines of code in COREUTILS (with libc code used in them). And the largest single program has only ~10000 lines.

It knows nothing about floating point, longjmp/setjmp, threads, asm; memory objects of variable size.

Update: /from llvm blog/ http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html

5 . The LLVM "Klee" Subproject uses symbolic analysis to "try every possible path" through a piece of code to find bugs in the code and it produces a testcase. It is a great little project that is mostly limited by not being practical to run on large-scale applications.

Update2: KLEE requieres program to be modified. http://t1.minormatter.com/~ddunbar/klee-doxygen/overview.html

. Symbolic memory is defined by inserting special calls to KLEE (namely klee_make_symbolic) During execution, KLEE tracks all uses of symbolic memory.

Cyan answered 21/4, 2011 at 14:28 Comment(4)

Dunno what your definition of "fast" is. I sure can't code tests for 141,000 lines of code in 89 hours. (I sure can turn on a machine Monday at 9 am and ignore it until Friday at 9 am if that will get the tests I can't code :) – Myca 21/4, 2011 at 22:48

I mean, that KLEE can require up to several month and greater to achieve good coverage for big project. – Cyan 22/4, 2011 at 0:35

So? Cycles and electricity are cheap. My personal time is not, and my skills at doing this reliably are not good. "Cycles are an engineers' best friend". – Myca 22/4, 2011 at 1:13

Klee does not strictly require programs to be modified in all cases. It can treat command line arguments, files and stdin as symbolic for its execution. – Calchas 29/3, 2012 at 18:9

Overall, KLEE should be a pretty good symbolic execution engine for academic research. For being used in practice, it could be limited by following aspects:

[1] The memory model used by the LLVM IR interpreter in KLEE is memory-consuming and time-consuming. For each execution path, KLEE maintains a private "address space". The address space maintains a "stack" for local variables and a "heap" for global variables and dynamically allocated variables. However, each variable (local or global) is wrapped in a MemoryObject object (MemoryObject maintain metadata related to this variable, such as the starting address, size, and allocation information). The size of memory used for each variable would be the size of the original variable plus the size of MemoryObject object. When an variable is to be accessed, KLEE firstly search the "address space" to locate the corresponding MemoryObject. KLEE would check the MemoryObject and see if the access is legitimate. If so, the access will be completed and state of the MemoryObject will be updated. Such memory access is obviously slower than RAM. Such a design can easily support the propagation of symbolic values. However, this model could be simplified via learning from taint analysis (labeling the symbolic status of variables).

[2] KLEE lacks models for system environments. The only system component modeled in KLEE is a simple file system. Others, such as sockets or multi-threading, are not supported. When a program invoke system calls to these non-modeled components, KLEE either (1) terminate the execution and raises an alert or (2) redirect the call to the underlying OS (Problems: symbolic values need to be concretized and some paths would be missed; system calls from different execution paths would interfere with each other). I suppose this is the reason for "it knowing nothing threads" as mentioned above.

[3] KLEE cannot directly work on binaries. KLEE requires LLVM IR of a to-be-tested program. While other Symbolic Execution tools, such as S2E and VINE from the BitBlaze project can work on binaries. An interesting thing is the S2E project relies on KLEE as its symbolic execution engine.

Regarding the above answer, I personally have different opinions. First, it's true that KLEE cannot perfectly work with large-scope programs, but which symbolic execution tool can? Path explosion is more a theoretical open problem instead of an engineering problem. Second, as I mentioned, KLEE might run slowly due to its memory model. However, KLEE does not slow down the execution for nothing. It conservatively checks memory corruptions (such as buffer overflow) and will log a set of useful information for each executed path (such as constraints on the inputs to follow a path). In addition, I did not know other symbolic execution tools that can run super fast. Third, "floating point, longjmp/setjmp, threads, asm; memory objects of variable size" are just engineering works. For example, the author of KLEE actually did something to enable KLEE to support floating point (http://srg.doc.ic.ac.uk/files/papers/kleefp-eurosys-11.pdf). Third, KLEE does not necessarily require instrumentation over the program to label symbolic variables. As mentioned above, symbolic values can be feed into the program via command lines. In fact, users can also specify files to be symbolic. If required, users can simply instrument the library functions to make inputs symbolic (once for all).

Tinned answered 30/9, 2015 at 15:40 Comment(0)

I'm disappointed by Klee that it doesn't have any builtin data data structure support. Klee itself doesn't understand what a linked list is. You cannot use it to verify any graph algorithm's implementation. Also, it cannot help catch interger overflow errors which is the most significant source of security patches.

Khedive answered 27/6, 2023 at 19:38 Comment(1)

I think it should be possible to check for signed or unsigned integer overflow by giving -fsanitize=[un]signed-integer-overflow to clang before running klee. – Caricaria 15/11, 2023 at 1:46

Recommended topics

Hot tags