This is a very interesting and deep question -- though perhaps one not well suited to this site.
The question, if I understand it, is what the impact is on the choice of language to analyze when doing static analysis in pursuit of defects; should an analyzer look at IL, or should it look at the source code? Note that I've broadened this question from the original narrow focus on divide-by-zero defects.
The answer is, of course: it depends. Both techniques are commonly used in the static analysis industry, and there are pros and cons of each. It depends on what defects you're looking for, what techniques you are using to prune false paths, suppress false positives and deduce defects, and how you intend to surface discovered defects to developers.
Analyzing bytecode has some clear benefits over source code. The chief one is: if you have a bytecode analyzer for Java bytecode, you can run Scala through it without ever writing a Scala analyzer. If you have an MSIL analyzer, you can run C# or VB or F# through it without writing analyzers for each language.
There are also benefits to analyzing code at the bytecode level. Analyzing control flow is very easy when you have bytecode because you can very quickly organize chunks of bytecode into "basic blocks"; a basic block is a region of code where there is no instruction which branches into its middle, and every normal exit from the block is at its bottom. (Exceptions can of course happen anywhere.) By breaking up bytecode into basic blocks we can compute a graph of blocks that branch to each other, and then summarize each block in terms of its action on local and global state. Bytecode is useful because it is an abstraction over code that shows at a lower level what is really happening.
That of course is also its major shortcoming; bytecode loses information about the intentions of the developer. Any defect checker which requires information from source code in order to detect the defect or prevent a false positive is going to give poor results when run on bytecode. Consider for example a C program:
#define DOBAR if(foo)bar();
...
if (blah)
DOBAR
else
baz();
If this horrid code were lowered to machine code or bytecode then all we would see is a bunch of branch instructions and we'd have no idea that we ought to be reporting a defect here, that the else
binds to the if(foo)
and not the if(blah)
as the developer intends.
The dangers of the C preprocessor are well known. But there are also great difficulties imposed when doing analysis of complex lowered code at the bytecode level. For example, consider something like C#:
async Task Foo(Something x)
{
if (x == null) return;
await x.Bar();
await x.Blah();
}
Plainly x
cannot be dereferenced as null here. But C# is going to lower this to some absolutely crazy code; part of that code is going to look something like this:
int state = 0;
Action doit = () => {
switch(state) {
case 0:
if (x == null) {
state = -1;
return;
};
state = 1;
goto case 1:
case 1:
Task bar = x.Bar();
state = 2;
if (<bar is a completed task>) {
goto case 2;
} else {
<assign doit as the completion of bar>
return;
}
case 2:
And so on. (Except that it is much, much more complicated than that.) This will then be lowered into even more abstract bytecode; imagine trying to understand this code at the level of switches being lowered to gotos and delegates lowered into closures.
A static analyzer analyzing the equivalent bytecode would be perfectly within its rights to say "plainly x can be null because we check for it on one branch of the switch; this is evidence that x must be checked for nullity on other branches, and it is not, therefore I will give a null dereference defect on the other branches".
But that would be a false positive. We know something that the static analyzer might not, namely, that the zero state must execute before every other state, and that when the coroutine is resumed x will always have been checked for null already. That's apparent from the original source code but would be very difficult to tease out from the bytecode.
What then do you do, if you wish to get the benefits of bytecode analysis without the drawbacks? There are a variety of techniques; for example, you could write your own intermediate language that was higher level than bytecode -- that has high-level constructs like "yield" or "await", or "for loop" -- write an analyzer that analyzes that intermediate language, and then write compilers that compile each target language -- C#, Java, whatever -- into your intermediate language. That means writing a lot of compilers, but only one analyzer, and maybe writing the analyzer is the hard part.
That was a very brief discussion, I know. It's a complex subject.
If the design of static analyzers on bytecode interests you, consider looking into the design of Infer, an open-source static analyzer for Java and other languages which turns Java bytecode into an even lower-level bytecode suitable for analysis of heap properties; read up on separation logic for inference of heap properties first. https://github.com/facebook/infer