Is it possible to decode x86-64 instructions in reverse?

M

2

9

I was wondering if it is possible to decode x86-64 instructions in reverse?

I need this for a runtime dissembler. Users can point to a random location in memory and then should be able to scroll upwards and see what instructions came before the specified address.

I want to do this by reverse decoding.

Manno answered 20/9, 2018 at 0:12 Comment(4)

why do you need to do that? Did you look at the x86 instruction format? It doesn't have any mirror component, so obviously can't be read backwards. The easiest example is instructions with an immediate at the end. The immediate can contain any bytes, so reading from the end it'll have another meaning – Syllable 20/9, 2018 at 0:30

@Syllable I need this for a runtime dissambler. User can point to a random location in memory and then should be able to scroll upwards and see what insturctions came before the sepcified address and for this I need to do reverse decoding. – Manno 20/9, 2018 at 0:36

It isn't straight forward and there are different techniques. Not always perfect either. Disassembling arbitrary binaries on x86/x86-64 have had research papers written on it: syssec.mistakenot.net/papers/sec-2016.pdf – Arianism 20/9, 2018 at 0:47

No you cannot do this because the x86 instruction set is variable length. Even with fixed length there can be confusion. The only way you can have a chance at correctly disassembling is to disassemble in execution order from a known good entry point, and even that is trivial to defeat such that the disassembler fails. – Truculent 20/9, 2018 at 13:27

S

9

The basic format of x86 instructions is like this

Modern CPUs can support VEX and EVEX prefixes. In x86-64 there might also be the REX prefix at the beginning

Looking at the format it can easily be seen that the instructions aren't palindromes and you can't read from the end.

Regarding determining which instruction an arbitrary address belongs to, unfortunately it can't be done either, because x86 instructions are not self-synchronizable, and (generally) not aligned. You have to know exactly the begin of an instruction, otherwise the instruction will be decoded differently.

You can even give addresses that actually contain data and the CPU/disassembler will just decode those as code, because no one knows what those bytes actually mean. Jumping into the middle of instructions is often used for code obfuscation. The technique has also been applied for code size saving in the past, because a byte can be reused and has different meanings depending on which instruction it belongs to

That said, it might be possible to guess in many cases since functions and loops are often aligned to 16 or 32 bytes, with NOPs padding around

Syllable answered 20/9, 2018 at 0:58 Comment(5)

I don't think the part about it not being a palindrome is relevant. That is, I don't think the OP thought that instructions might be palindromes and palindromes aren't a requirement for reverse disassembly. As a trivial example, fixed length ISAs aren't palindromes (in general!) yet can be reverse disassembled. You could also have a variable length ISA that could be reverse disassembled without being a palindrome although I'm not aware of any examples. – Aphrodite 20/9, 2018 at 2:59

@Aphrodite at first I thought that the OP was asking if it's possible to read the bytes backward. I didn't think about disassembling the instruction of course it must be possible to decode those in order for the CPU to run – Syllable 20/9, 2018 at 3:38

The OP is asking if it is possible to disassemble backwards. This involves both reading bytes backwards (easy!), and then disassembling. The latter part isn't possible as both answers here show, since you don't know where the instruction boundaries are: but the whole palindrome thing is a red herring. For example imagine that all x86 instructions started with a 0xFF byte and this byte never appeared anywhere else in any instruction due to the encoding rules (in particular, you'd have to have a special encoding for immediate values that excluded it). Then you could disassemble backwards. – Aphrodite 20/9, 2018 at 19:57

@Aphrodite "variable length ISA that could be reverse disassembled without being a palindrome although I'm not aware of any examples" - not quite an ISA but I believe UTF-8 could be an example – Snakeroot 2/8, 2020 at 11:48

@Paul: Yes, UTF-8 is specifically designed to be self-synchronizing, so you can seek somewhere and read until you get to what's definitely the start of a new character. This costs it 1 bit per byte in each of the continuation bytes, they have to be 1 instead of holding useful information. That's half the values of each byte not available. It's very valuable for text, but not for machine code, so real-world ISA designs don't make that tradeoff. (Maybe ARM Thumb2 or RV32C compressed insns do? With only 2 sizes, 16 and 32-bit, the simple signalling mechanism for fwd decode might allow rev.) – Barre 21/7, 2022 at 15:47

B

11

An x86 instruction stream is not self-synchronizing, and can only be unambiguously decoded forward. You need to know a valid start-point to decode. The last byte of an immediate can be a 0x90 which decodes as a nop, or in general a 4-byte immediate or displacement can have byte-sequences that are valid instructions, or whatever other overlap possibilities with ModRM/SIB bytes looking like opcodes.

If you decode forward in code that isn't intentionally obfuscated, you often get back into sync with the "correct" instruction boundaries, so you might try remembering the instruction boundaries as a known-good point, and check that a decoding from a backwards-step candidate start address has an instruction boundary at your known-good point.

IDK if you could get more clever about finding more known-good points going backwards which further candidates also have to agree with.

Be sure to highlight backwards-decoded instructions for the user in red or gray or something, so they know it's not guaranteed reliable.

Another alternative is to require function symbols (extern functions, or any function with debug info).

GDB doesn't allow you to scroll upward (in layout reg mode), unless you're inside a function that it knows the start address. Then I guess it decodes from the function start address so it knows instruction boundaries when it gets to the part that fits in the window.

If you want to go backwards, you have to disas 0x12345, +16 to start decoding from there. Then you can scroll down, but if you get the insn boundary wrong you get garbage.

Barre answered 20/9, 2018 at 1:5 Comment(3)

Your last point is probably the most important one. Going back 16 bytes (addr - 16) , and decoding from that address +n until there's a valid set of instructions stopping at addr. The issue is just that there can be several valid sets of instructions. – Alexina 13/4, 2019 at 23:53

I haven't read the source code but I think disassemblers do synchronize at function labels. I was finding myself wanting to put a constant into .text before a function and was wondering if it would trip up objdump into decoding nonsense. It never did. I tried for every possible twobyte (using a script naturally). – Haruspex 20/12, 2022 at 12:21

@PSkocik: Oh yes, good point, I think I recall observing that, too. – Barre 20/12, 2022 at 18:41

S

9

The basic format of x86 instructions is like this

Modern CPUs can support VEX and EVEX prefixes. In x86-64 there might also be the REX prefix at the beginning

Looking at the format it can easily be seen that the instructions aren't palindromes and you can't read from the end.

Regarding determining which instruction an arbitrary address belongs to, unfortunately it can't be done either, because x86 instructions are not self-synchronizable, and (generally) not aligned. You have to know exactly the begin of an instruction, otherwise the instruction will be decoded differently.

You can even give addresses that actually contain data and the CPU/disassembler will just decode those as code, because no one knows what those bytes actually mean. Jumping into the middle of instructions is often used for code obfuscation. The technique has also been applied for code size saving in the past, because a byte can be reused and has different meanings depending on which instruction it belongs to

That said, it might be possible to guess in many cases since functions and loops are often aligned to 16 or 32 bytes, with NOPs padding around

Syllable answered 20/9, 2018 at 0:58 Comment(5)

I don't think the part about it not being a palindrome is relevant. That is, I don't think the OP thought that instructions might be palindromes and palindromes aren't a requirement for reverse disassembly. As a trivial example, fixed length ISAs aren't palindromes (in general!) yet can be reverse disassembled. You could also have a variable length ISA that could be reverse disassembled without being a palindrome although I'm not aware of any examples. – Aphrodite 20/9, 2018 at 2:59

@Aphrodite at first I thought that the OP was asking if it's possible to read the bytes backward. I didn't think about disassembling the instruction of course it must be possible to decode those in order for the CPU to run – Syllable 20/9, 2018 at 3:38

The OP is asking if it is possible to disassemble backwards. This involves both reading bytes backwards (easy!), and then disassembling. The latter part isn't possible as both answers here show, since you don't know where the instruction boundaries are: but the whole palindrome thing is a red herring. For example imagine that all x86 instructions started with a 0xFF byte and this byte never appeared anywhere else in any instruction due to the encoding rules (in particular, you'd have to have a special encoding for immediate values that excluded it). Then you could disassemble backwards. – Aphrodite 20/9, 2018 at 19:57

@Aphrodite "variable length ISA that could be reverse disassembled without being a palindrome although I'm not aware of any examples" - not quite an ISA but I believe UTF-8 could be an example – Snakeroot 2/8, 2020 at 11:48

@Paul: Yes, UTF-8 is specifically designed to be self-synchronizing, so you can seek somewhere and read until you get to what's definitely the start of a new character. This costs it 1 bit per byte in each of the continuation bytes, they have to be 1 instead of holding useful information. That's half the values of each byte not available. It's very valuable for text, but not for machine code, so real-world ISA designs don't make that tradeoff. (Maybe ARM Thumb2 or RV32C compressed insns do? With only 2 sizes, 16 and 32-bit, the simple signalling mechanism for fwd decode might allow rev.) – Barre 21/7, 2022 at 15:47

Recommended topics

Hot tags