There is no perfectly reliable way to do this for every export.
Each export only specifies an offset within the executable file -- logically, it could be treated as code or as data by any other code that references it.
As you mentioned, you could come up with heuristics to detect the type of the export in almost all of the cases, but it would be easy to come up with counterexamples that do not work for any given heuristic. Take, for instance, the rule you proposed:
The exported entry will be considered a valid exported function if there is a ret
instruction in the function, and there are more than <min>
valid instructions, and IDA recognizes the function's calling convention.
False negatives: You might have a function that uses tail call optimization and ends with jmp
instructions rather than ret
instructions. Any short function would also fail. And there are several ways that IDA can be confused into not treating the code as a function.
False positives: There could be a string in memory followed closely by a C3
or C2
like db 'BACKGAMMON0',0,0C3h
-- this could logically disassemble as a valid 11-instruction function with a ret
and no arguments.
The lines are blurred even further when you consider that an export could be logically treated as both code and data: Imagine that a byte sequence at an export is copied into dynamically allocated memory -- potentially even in another process -- where it is later executed as code.
Perhaps a reasonable suggestion would be to just trust IDA and treat the export as code if IDA thinks it's code. A large part of IDA's functionality is automatically guessing the logical types of data, and it's normally pretty good at it. As you've shown, sometimes it's wrong. But you can't get 100% accuracy anyway. The best you can do is balance between false negatives and false positives.
Proof of this problem's undecidability:
Whether or not an export will be executed as code is undecidable. Whether or not an export will be read as data is also undecidable. Since we cannot guarantee that either is true, distinguishing between seemingly ambiguous cases is impossible.
Proof: Assume that we have an oracle A(P,I,E)
which returns 1 if program P
(including all of its dependencies) executes (or reads from) export E
(from any DLL loaded in the course of P
's execution) with "input" (external state) I
. Otherwise, it returns 0.
Let us construct a minimal program Z(P,I,E)
which executes (or reads from) export E
(the DLL for which is loaded into the address space) if and only if A(P,I,E)
returns 0.
Now consider the result of Z(Z,I,E)
:
If Z(Z,I,E)
executes (or reads from) export E
, then A(Z,I,E)
would return 1. But Z(Z,I,E)
is defined to not access export E
unless A(Z,I,E)
returns 0. This is a contradiction.
If Z(Z,I,E)
does not execute (or read from) export E
, then A(Z,I,E)
would return 0. But Z(Z,I,E)
is defined such that it will access export E
when A(Z,I,E)
returns 0. This is a contradiction.
Therefore, our initial assumption that oracle A(P,I,E)
exists is proven false.
But you can do better through instrumentation...
Depending on the exact problem you're trying to solve, you may be able to determine which exports are valid functions at runtime.
For example, you could write an application which debugs the program you which to analyze and places guard pages on each of the pages that contain exports you wish to hook. This means, whenever a page is access (executed/read/written to), an exception is raised, and the debugger program gains control.
The debugger could check the program context to see what type of access was made and whether it has anything to do with the export. If the access is an attempt to execute an export, it could perform some hooking functionality before returning control to the program. Otherwise, it could just return control to the program.
In either case, the PAGE_GUARD
modifier is lifted after each exception, so you'd need to put it back each time.
Unsurprisingly, this would make execution of your program very slow, as any R/W/X access to any of the pages containing an export causes an expensive context switch -- this would likely include the execution of most instructions that are a part of your exported functions, along with several others that have nothing to do with them.
You could take a similar approach with other instrumentation tools, such as Pin.
Note that you may not gain information about the usage of every export through instrumentation. This is because you may need to determine what input/external state is required to cause the program to access each export in order to learn if it is used as code or as data (if at all).
Also note that both execute and read (or even write) accesses could potentially occur on the same exports.
ret
instruction in the function AND there is more than<min>
valid instructions AND IDA recognize the function's calling convention. Still, got some false positives that I wish to identify – Mikiso