"Finding all the code in a given binary is equivalent to the Halting problem." Really?
Asked Answered
M

4

10

Was just reading the highly voted question regarding Emulators and the statement

It's been proven that finding all the code in a given binary is equivalent to the Halting problem.

Really stuck out at me.

Surely that can't be true? Isn't it just a large dependency graph?

Would be really grateful for some further insight into this statement.

Moonier answered 14/3, 2011 at 13:59 Comment(4)
What do you mean with "finding the code"? Reverse-engineering or?Wynellwynn
My understanding by what HE/SHE means is that finding the entire chain of code including dependencies. Look for the line with that text in the selected answer to see context.Moonier
Should you ask this at theoretical cs?Raab
See my answer. Finding all the code in a program is trivial as long as all branches have fixed target addresses. Function pointers/computed gotos/self-modifying code are the possible complications.Goglet
F
4

I believe what is meant is "finding all code that is ever executed", i.e. finding coverage, possibly in combination with dynamically generated code. That can indeed be reduced to the halting problem.

Say that you have a perfect coverage tool that will find every piece of code in a program that may ever be executed (so the rest is dead code). Given a program P, this tool would also be able to decide whether the extended program (P ; halt) ever executes the halt instruction, or whether the halt part is dead code. So, it would solve the halting problem, which we know is undecidable.

Freyah answered 14/3, 2011 at 14:39 Comment(1)
Having spent some time thinking about your argument. I am not sure I am convinced. As suggested in the answer below we are not trying to decide if the program will halt (although I see the parallels in this problem). We trying to find all dependencies for a given program. More fundamentally aren't all dependencies encoded inside the application with metadata? (I guess not because you can load them at runtime - but then the dependency would be stored in a constant or variable at some point) hmmmmm. I should prob work out how to turn this to a wiki.Moonier
W
5

I disagree with larsman.

The halting problem says that no program P exists that can take any program and decide whether that program executes the halt instruction. Let me quote wikipedia:

Alan Turing proved in 1936 that a general algorithm to solve the halting problem for all possible program-input pairs cannot exist. We say that the halting problem is undecidable over Turing machines.

On the other hand we're not trying to make such program/algorithm, but we're trying to find all the code in this/these specific program(s). If we reverse-engineer the program and see that it immediately calls exit() (very optimistic example situation) we have proven that it will call halt, while it was impossible?!

If we we're trying to build an emulator that can run any program we would fail since then you can (easily) reduce that to the Halting problem. But usually you are building an emulator for something like a Game Boy which supports a finite amount of game cartridges (programs) and thus it is possible.

Wynellwynn answered 14/3, 2011 at 15:55 Comment(4)
Is that meant as "WOOOW" or "did I waste my time on this guy?!"?Wynellwynn
No I meant it as a compliment and a realization as to how stupid I am. :-( lolMoonier
"If we reverse-engineer the program and see that it immediately calls exit() (very optimistic example situation) we have proven that it will call halt, while it was impossible?!" The halting problem does not imply that this is impossible for every case, but that there are some cases for which it's impossible.Freyah
As for the "finite amount of memory" argument: you're right, since a Gameboy (or even a PC) is formally a finite-state machine. The number of states, however, is two to the power [the amount of RAM in the machine]. So it's still intractable, as is common with the finite cases of undecidable problems.Freyah
F
4

I believe what is meant is "finding all code that is ever executed", i.e. finding coverage, possibly in combination with dynamically generated code. That can indeed be reduced to the halting problem.

Say that you have a perfect coverage tool that will find every piece of code in a program that may ever be executed (so the rest is dead code). Given a program P, this tool would also be able to decide whether the extended program (P ; halt) ever executes the halt instruction, or whether the halt part is dead code. So, it would solve the halting problem, which we know is undecidable.

Freyah answered 14/3, 2011 at 14:39 Comment(1)
Having spent some time thinking about your argument. I am not sure I am convinced. As suggested in the answer below we are not trying to decide if the program will halt (although I see the parallels in this problem). We trying to find all dependencies for a given program. More fundamentally aren't all dependencies encoded inside the application with metadata? (I guess not because you can load them at runtime - but then the dependency would be stored in a constant or variable at some point) hmmmmm. I should prob work out how to turn this to a wiki.Moonier
G
3

The halting problem for finite state machines is solvable (although it might take many lifetimes.....of the universe), and any machine you might be emulating is a finite state machine. Just run the program, and the number of steps is bounded by the number of possible states; if that number is exceeded without halting then the program will never halt, since it must be in a loop.

Practically speaking, finding all code is a much easier problem unless the code can use computed gotos. Rather than running the code, you simply take all branches exactly once at each possible branch point.

Goglet answered 25/3, 2011 at 18:9 Comment(1)
Even without computed gotos, it's not an easy problem. For example, a conditional branch may always have its condition on or off. In this 8086 example: CLC; JC lbl the program never jumps to lbl, so the analyser shouldn't assume lbl is code. Similarly, in CLC; JNC lbl; POP CS the code always jumps to lbl and therefore POP CS is most likely dead code (and it better is, since that instruction was removed in 80186 due to being totally useless). Such jumps were common in 6502 code, as 6502 lacked a short relative unconditional jump instruction.Maurits
C
0

I agree with Larsman, and I believe the argument can be made precise. First, I have to agree that "finding all the code in a given binary" means, in this context, having a single computable function that determines which bytes within an input binary correspond to instructions that are executed.

EDIT: If anyone is not clear on why there is a problem here, think about obfuscated machine code. Disassembly of such code is a non-trivial problem. If you begin disassembly in the middle of a multi-byte instruction, you get the wrong disassembly. This breaks not only the instruction in question, but typically a few instructions beyond that. etc. look into it. (for example, google "obfuscation and disassembly").

Sketch of strategy to make this precise:

First, define a theoretical computer / machine model in which a program is encoded in multi-bit binary instructions, much like machine code on "real" computers, but made precise (and such that it is turing complete). Assume that the binary encodes the program and all input. This must all be made precise, but assume that the language has a (binary encoded) halt instruction, and that a program halts if and only if it executes a 'halt' instruction.

First, let us assume that the machine is not able to alter the program code, but has a memory in which to work. (assumed infinite in interesting cases).

Then any given bit will either be at the start of an instruction that is run during execution of the program, or it will not. (e.g. it may be in the middle of an instruction, or in data or "junk code" -- that is, code that will never run. Note that I have not claimed these are mutually exclusive, as, for example, a jump instruction can have a target that is in the middle of a particular instruction, thereby creating another instruction that, "on first pass" did not look like an executed instruction.).

Define ins(i, j) to be the function that returns 1 if the binary i has an instruction beginning at bit-position j, that executes in a run of the program encoded by i. Define ins(i,j) to be 0 otherwise. suppose ins(i,j) is computable.

Let halt_candidate(i) be the following program:

for j = 1 to length(i)
  if(i(j..j+len('halt')-1) == 'halt')
    if(ins(i,j) == 1)
      return 1
return 0

Since we disallow self-modifying code above, they only way for a program to halt is for there to be a 'halt' instruction at some position j that gets executed. By design, halt_candidate(i) computes the halting function.

This provides a very rough sketch of one direction of the proof. i.e. if we assume we can test whether program i has an instruction at location j for all j, then we can code the halting function.

For the other direction, one must encode an emulator of the formal machine, within the formal machine. Then, create a program plus inputs i and j which emulates program i and when an instruction at bit position j is executed, it returns 1 and halts. When any other instruction is executed it continues to run, and if the simulation ever simulates a 'halt' function in i, the emulator jumps to an infinite loop. Then ins(i,j) is equivalent to halt(emulator(i,j)), and so the halting problem implies the find code problem.

Of course one must assume a theoretical computer for this to be equivalent to the famously unsolvable halting problem. Otherwise, for a "real" computer, the halting problem is solvable but intractable.

For a system that allows self-modifying code the argument is more complex, but I expect not that different.

EDIT: I believe a proof in the self-modifying case would be to implement an emulator of the system-plus-self-modifying-code in the static code plus data system above. Then halting with self-modifying code allowed for program i with data x, is the same as halting in the static code plus data system above with the binary containing the emulator, i, and x as code plus data.

Collaboration answered 22/5, 2014 at 13:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.