DivideByZeroException compiler check complexity: easier or harder in MSIL vs C# or no difference?
Asked Answered
D

1

7

This is a question related to this fascinating question about detecting divide by zero exceptions at compile time.

From Eric Lippert's answer, this is non-trivial to achieve properly (which I suppose is why it's not provided already).

My question is:

Is the level of difficulty of doing these types of checks the same regardless of the "level" of the language e.g. higher level vs lower level?

Specifically, the C# compiler converts C# to MSIL. Would these types of checks be easier or harder at the MSIL level as part of some kind of second pass check?

Or, does the language itself make very little difference at all?

Reading the gotchas listed in Eric's answer, I would assume the checks would have to be the same in any language? For example, you can have jumps in lots of languages and would therefore need to implement the flow checking Eric describes...?

Just to keep this question specific, would this kind of check be easier or harder in MSIL than it is in C#?

Determinant answered 9/1, 2018 at 15:53 Comment(5)
Generally speaking, programming in low-level languages is harder than in high-level languages. I'm guessing this case is no exception.Justis
I gather programming is but I was wondering if analysing follows the same rule? Lower level languages have less constructs and less abstractions. Does that make analysing the possible outcomes easier, harder or does it really make no difference? I was trying to visualise it. Divide ends up as two pushes and an op code. But what you are pushing could be anything. Does the problem become less complex with less possible commands or again does it make no difference because the issue as actually about flow detection. I can't decide...Determinant
Analyzing higher concepts is easier than low level. Consider trying to determine the allergy risk of a cake given the recipe, as opposed to being given a finished cake without a recipe.Salyer
I appreciate the reply and get the analogy but am struggling to understand. In this scenario, you have two recipes (C# and MSIL). One has less instructions but the instructions are chunkier (excuse the non-technical term). The other has a larger number of more granular instructions. How does either case effect the difficulty of detecting if two operands of one of the bits of recipe are compatible?Determinant
@RaymondChen: Though I take your point, we can run your analogy to pump intuition in the other direction as well; determining the calorie content of a cake gives more accurate results if you put the finished cake into the calorimeter than if you put the cookbook into the calorimeter. :-) I've worked on static analyzers that do very well with bytecode languages because the abstractions of the programming language impede rather than enhance the ability to see the defect.Jacinda
J
13

This is a very interesting and deep question -- though perhaps one not well suited to this site.

The question, if I understand it, is what the impact is on the choice of language to analyze when doing static analysis in pursuit of defects; should an analyzer look at IL, or should it look at the source code? Note that I've broadened this question from the original narrow focus on divide-by-zero defects.

The answer is, of course: it depends. Both techniques are commonly used in the static analysis industry, and there are pros and cons of each. It depends on what defects you're looking for, what techniques you are using to prune false paths, suppress false positives and deduce defects, and how you intend to surface discovered defects to developers.

Analyzing bytecode has some clear benefits over source code. The chief one is: if you have a bytecode analyzer for Java bytecode, you can run Scala through it without ever writing a Scala analyzer. If you have an MSIL analyzer, you can run C# or VB or F# through it without writing analyzers for each language.

There are also benefits to analyzing code at the bytecode level. Analyzing control flow is very easy when you have bytecode because you can very quickly organize chunks of bytecode into "basic blocks"; a basic block is a region of code where there is no instruction which branches into its middle, and every normal exit from the block is at its bottom. (Exceptions can of course happen anywhere.) By breaking up bytecode into basic blocks we can compute a graph of blocks that branch to each other, and then summarize each block in terms of its action on local and global state. Bytecode is useful because it is an abstraction over code that shows at a lower level what is really happening.

That of course is also its major shortcoming; bytecode loses information about the intentions of the developer. Any defect checker which requires information from source code in order to detect the defect or prevent a false positive is going to give poor results when run on bytecode. Consider for example a C program:

#define DOBAR if(foo)bar();
...
if (blah)
  DOBAR
else
  baz();

If this horrid code were lowered to machine code or bytecode then all we would see is a bunch of branch instructions and we'd have no idea that we ought to be reporting a defect here, that the else binds to the if(foo) and not the if(blah) as the developer intends.

The dangers of the C preprocessor are well known. But there are also great difficulties imposed when doing analysis of complex lowered code at the bytecode level. For example, consider something like C#:

async Task Foo(Something x) 
{
  if (x == null) return;
  await x.Bar();
  await x.Blah();
}

Plainly x cannot be dereferenced as null here. But C# is going to lower this to some absolutely crazy code; part of that code is going to look something like this:

int state = 0;
Action doit = () => {
  switch(state) {
    case 0: 
      if (x == null) {
        state = -1;
        return;
      };
      state = 1;
      goto case 1:
    case 1:
      Task bar = x.Bar();
      state = 2;
      if (<bar is a completed task>) {
        goto case 2;
      } else {
        <assign doit as the completion of bar>
        return;
      }
    case 2:

And so on. (Except that it is much, much more complicated than that.) This will then be lowered into even more abstract bytecode; imagine trying to understand this code at the level of switches being lowered to gotos and delegates lowered into closures.

A static analyzer analyzing the equivalent bytecode would be perfectly within its rights to say "plainly x can be null because we check for it on one branch of the switch; this is evidence that x must be checked for nullity on other branches, and it is not, therefore I will give a null dereference defect on the other branches".

But that would be a false positive. We know something that the static analyzer might not, namely, that the zero state must execute before every other state, and that when the coroutine is resumed x will always have been checked for null already. That's apparent from the original source code but would be very difficult to tease out from the bytecode.

What then do you do, if you wish to get the benefits of bytecode analysis without the drawbacks? There are a variety of techniques; for example, you could write your own intermediate language that was higher level than bytecode -- that has high-level constructs like "yield" or "await", or "for loop" -- write an analyzer that analyzes that intermediate language, and then write compilers that compile each target language -- C#, Java, whatever -- into your intermediate language. That means writing a lot of compilers, but only one analyzer, and maybe writing the analyzer is the hard part.

That was a very brief discussion, I know. It's a complex subject.

If the design of static analyzers on bytecode interests you, consider looking into the design of Infer, an open-source static analyzer for Java and other languages which turns Java bytecode into an even lower-level bytecode suitable for analysis of heap properties; read up on separation logic for inference of heap properties first. https://github.com/facebook/infer

Jacinda answered 9/1, 2018 at 18:48 Comment(1)
Thank you for taking the time to reply. I'm reading various books on C# internals and I've been looking at the MSIL generated from my own code. The obvious question when learning new concepts is how I can use what I've learned to bring benefit to my team. Defect detection has come up often. We use the JetBrains tools but haven't touched Roslyn yet. It occurred to me that that we have always had the C# and the MSIL so what power do we have that we are not using? You have given me a lot to think about. The link is especially appreciated as a research path. Much more reading to do.Determinant

© 2022 - 2024 — McMap. All rights reserved.