Is finding pointers in C/C++ code statically equivalent to the Halting Ρroblem?
Asked Answered
M

4

7

I'm not too deeply rooted in the very formal side of static code analysis, hence this question.

A couple of years ago I read that distinguishing code from data using static code analysis is equivalent to the Halting Problem. (Citation needed, but I don't have it anymore. Stackoverflow has threads on this here or here.) At least for common computer architectures based on the Von Neumann architecture where code and data share the same memory this seemed to make sense.

Now I'm looking at the static analysis of C/C++ code and pointer analysis; the program does not execute. Somehow I have a feeling that tracking all creations and uses of pointer values statically is similar to the Halting Problem because I can not determine if a given value in memory is a pointer value, i.e. I can not track the value-flow of pointer values through memory. Alias analysis may narrow down the problem, but it seems to become less useful in the face of multi-threaded code.

(One might even consider tracking arbitrary values, not just pointers: constructing a complete value-flow for any given "interesting" value seems equivalent to the Halting Problem.)

As this is just a hunch, my question is: are the more formal findings on this that I can refer to? Am I mistaken?

Marolda answered 30/12, 2013 at 22:53 Comment(6)
@Dukeling: Well, static analysis seems to imply that the program is not running. I think that the OP is trying to analyze the program looking at the C/C++ source code. Or maybe the generated assembler?Depth
How to ask about the 'halting problem' on StackOverflow, regarding the title filter - no useful answers there though :(Freida
The premise of this question is trumped by the intent. You use static analysis to discover problems in code, undefined behavior in particular. So your starting point is assuming that you have a mal-formed program and you cannot make any assumptions that language rules apply. You cannot formulate a theory on a foundation without rules.Premonish
@HansPassant: not sure I grok what you mean?Prudence
@HansPassant: I think that's irrelevant. Even if I assume a correct program, I still seem to be unable to track all pointer values conclusively.Marolda
@HansPassant Your remark containing the word “assuming” is easily circumvented by limiting oneself to static analyzers that discover the first Undefined Behavior in an execution of the C program on the “abstract machine” that is supposed to be defined by the C standard. You are right that predicting anything after the first UB is unreasonable. There are two articles that cover this in some detail in the proceedings of compare2012.verifythis.org . You should recognize them easily if interested (or just look for the article of which I am a co-author).Wallflower
P
3

You can always code up this:

extern bool some_program_halts();
extern int* invalid_pointer();

#include <iostream>
int main()
{
    using namespace std;
    if( some_program_halts() ) { cout << *invalid_pointer() << endl; }
}

Checking whether this program dereferences the invalid pointer is equivalent to finding out whether the call to some_program_halts(), uh, halts.

Prudence answered 30/12, 2013 at 23:11 Comment(5)
Oh, by the way, just because!, a link to my "Ironclad proof that you are smarter than Roger Penrose". He he. :)Prudence
Agreed. Now let's assume a closed world (where we do have access to all code, and no code is added at runtime. :)Marolda
@Jens: I'm sorry, I don't understand.Prudence
The above example calls out to two extern functions, i.e. functions whose behavior can not be examined because their definition is missing. That's an open world scenario. In a closed world scenario, all possible (i.e. called/referenced) functions are defined and available to the analysis.Marolda
Oh I didn't mean to imply any missing definitions. From a C++ point of view the keyword extern is technically redundant in the above code. You can just remove it if it seems to indicate "missing". ;-)Prudence
U
1

It's almost certainly equivalent, modulo the fact that C is not a turing-equivalent language (a given C implementation is a gigantic finite state machine rather than a turing machine, due to the Representation of Types). Pointers need not be kept in their original representations in objects whose effective type is pointer type; you can examine the representation and perform arbitrary operations on it, for example, encrypting pointers and decrypting them later. Determining whether an arbitrary computation is reversible, or whether two computations are inverses of one another, is (offhand) probably equivalent to determining halting.

Urana answered 30/12, 2013 at 23:1 Comment(15)
re " C is not a turing-equivalent language", if you mean "turing complete" then that's wrong. the core language does not define any i/o facilities or means to interface with hardware or anything, so as a statement about the core language it's literally true, but totally irrelevant. once you add i/o, however, you get infinite memory. and that observation was made long before Turing's time, even by George Boole.Prudence
C does not define IO facilities that allow addressing unlimited storage; they may exist as an implementation extension, but not something usable by a strictly conforming program.Urana
@R: Oh, the old overwhelm-with-assertions thing. but ok. the first part of the sentence, "C does not define IO facilities that allow addressing unlimited storage" is trivially wrong. Anyone can write a program that accesses unlimited storage. Given such an infinity of counter-examples it's hard to imagine how you thought anyone would believe the assertion. Re "they may exist as an implementation extension", certainly, that also, but the standard library already covers it. "but not something usable by a strictly conforming program" is incorrect. in short, wrong, irrelevant, wrong.Prudence
POSIX claims not to conflict with ISO C, and requires fseek to fail with EOVERFLOW if the resulting offset is not representable in type long. Assuming this behavior conforms with ISO C (you may dispute that), that destroys one of your best chances of achieving unlimited storage, though perhaps you could argue that it's recoverable by using rewind and repeated getc to position yourself. However even then, the number of times you call getc is limited by the largest number you can otherwise represent; you could apply the principle of using files to store "unlimited size" data...Urana
...recursively, but eventually you hit the fact that the number of recursion levels you can have is limited by CHAR_BIT*sizeof(void *) since each object has a unique address and addresses have representations per Representation of Types. The size of the state machine is probably a lot larger than the naive 2^BITS_OF_RAM guess, but I claim it's still ultimately finite, and the finitude derives fully from Representation of Types.Urana
@R: the old "proof that doors can not open by example of one's inability to imagine how to open a door". well. i don't know what to say, it's so ludicrous. maybe ... you're not aware that the standard library provides inifnite streams? in C they're called stdin, stdout and stderr.Prudence
@Cheersandhth.-Alf: They are not infinite streams. They're arbitrary stdio FILE streams that, depending on the implementation, may be attached to the same sort of seekable files you obtain with fopen, interactive devices, or something else. In the cases where they're associated with a terminal or pipe, they are entirely one-way and provide no storage at all since the program can never read back what it wrote.Urana
Your assertions do not stop any novice from writing a standard C program that accesses infinite storage. In your previous response you committed the fallacy of "example of doing something else than X, hence X is impossible" (not sure what it's called or if it was used enough in the old days to have a name). Now you have repeated that fallacy. Can't you turn to new fallacies. Please.Prudence
@Cheersandhth.-Alf Infinite streams stdin and stdout make a terrible offhand example of an infinite ribbon, since a Turing Machine infinite ribbon is something that one can write to and read back the same thing from. One can, however, interface a C program with an infinite ribbon. Such a composite C program + ribbon would be an appropriate Turing Machine. This is not a very insightful statement to make, however, as a finite automaton interfaced with an infinite ribbon is a Turing Machine too. This does not tell us that a C program is more than a finite automaton.Wallflower
@PascalCuoq: assessment of turing completeness always assumes "can have" about the infinite part. and so your insight that a turing machine is a finite automaton with infinite memory, while a C program is a finite automation which can have infinite memory, is, perhaps unintended, very much void of meaning, nonsense. did you mean to write something that you forgot to write?Prudence
@PascalCuoq: sorry, your statement "a finite automaton interfaced with an infinite ribbon is a Turing Machine too" is incorrect if it is assumed to be meaningful (talking about turing completeness). only a small subset of possible finite automatons are turing complete. the smallest known such is very small indeed, with just a few states, but as I recall out of the 256 machines that could be generated with that particular pattern, only 2 were turing completePrudence
@Cheersandhth.-Alf My statement that "a finite automaton interfaced with an infinite ribbon is a Turing Machine too" was an informal allusion to the definition of a Turing Machine as a finite action table together with an infinite ribbon. Any action table can be part of a Turing Machine. I am sorry if it hurts your intuition that not all thus defined Turing Machines are Turing-Complete, but I did not choose the definitions.Wallflower
@PascalCuoq: you are misleading readers with your statement "I am sorry if it hurts your intuition that not all thus defined Turing Machines are Turing-Complete". Contrary to what you indicate, I have stated that "only a small subset of possible finite automatons are turing complete", in response to your indication that it was not. In addition to misleading readers you are effectively drowning things in noise. Following on the heels of a long sequence of fallacies and evasions by the "R" nick, what am I to conclude from your misleading no sense statements?Prudence
-1 For trivially incorrect statement that "C is not a turing-equivalent language", and severe unwillingness to correct, including a stream of evasions and fallaciesPrudence
@Cheersandhth.-Alf: I did not downvote your answer; I didn't even vote on it at all, as it's rather orthogonal to the discussion we've had here.Urana
M
0

If I understood you correctly: yes, checking whether a C or C++ program accesses an invalid pointer is equivalent to the halting problem (of a C or C++ program, in any case).

Suppose you had a tool that told you whether a program accessed an invalid pointer, and a program you wanted to check for halting. By adding extra information to each pointer you can make it checkable (at runtime) whether the pointer is valid or not; add such checks, with an infinite loop on failure. You now have a program with no invalid pointer accesses. By replacing all places the program can terminate with an invalid pointer access you get a program which has an invalid pointer access if and only if the original program terminates.

Minter answered 30/12, 2013 at 23:5 Comment(4)
I don't want to check if a program accesses an invalid pointer, I want to track the flow of a pointer value. It seems to me that tracking that flow from where a pointer value is created to all of its possible uses is equivalent to the Halting Problem because I can not track values statically through memory.Marolda
@Jens: Would this include finding all values? If so, then a similar argument works: assuming you can find all uses of each pointer in your program, add a new pointer, and make it used at every exit. If your detector finds the pointer has been used, your program terminates, otherwise it doesn't.Minter
Not all values, just select ones. But for those select ones I would want to find all uses. And that is hard considering that the value-flow of values in C/C++ is statically not completely trackable; I can only ever approximate.Marolda
Okay, well, the algorithm I described above would work for checking for halting, so yes, your problem is equivalent to the halting problem.Minter
C
0

Static analysis is almost always an approximation, often provable by reduction to the halting problem with programs like the one in Alf's answer. However, the approximation can err on the side of either false positives or false negatives.

  • A "conservative" static check will only have false negatives. It will never accept a "bad" program, but it will inevitably reject some "sufficiently complicated" good programs.
  • A "liberal" static check will have false positives. Sometimes it accepts a bad program by mistake but (generally) it will also accept all good programs.

Some examples:

  • Java's type system is conservative: a variable with a type T will always contain an instance of type T (or a subtype of T or null) at runtime no matter what.
  • GCC's option to warn about uninitialized variables is liberal: it doesn't find all potential uses of an uninitialized variable. Here's an example of a false positive program.
  • In contrast, Java does a conservative uninitialized variable check for local variables. It refuses to compile the program if it sees any potential execution path using a potentially uninitialized variable.

Liberal checks are often used by compilers to emit warnings and by external static analysis tools. Things like type systems and compiler optimizations tend to rely on conservative checks to be correct.

Many tasks have several reasonable conservative and liberal algorithms of varying accuracy. Alias analysis is certainly one of these.

For more information, see any good compiler textbook, such as the dragon book.

Captious answered 31/12, 2013 at 22:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.