Traditional (single-byte) prefixes are different from opcode bytes like you said, so a state machine can just remember which prefixes it's seen until it gets to an opcode byte.
The 0f
escape byte for 2-byte opcodes is not really a prefix. It has to be contiguous with the 2nd opcode byte. Thus, following a 0f
, any byte is an opcode, even if it's something like f2
that would otherwise be a prefix. (This also applies following 0f 3a
or 0f 38
2-byte escapes for SSSE3 and later, or VEX/EVEX prefixes that encode one of those escape sequences).
If you look at an opcode map, there are no entries that are ambiguous between single-byte prefix and opcode. (e.g. http://ref.x86asm.net/coder64.html, and notice how the 2-byte 0F .. opcodes are listed separately).
The decoders do have to know the current mode for this (and other things); for example x86-64 removed the 1-byte inc/dec reg
opcodes for use as REX prefixes. (x86 32 bit opcodes that differ in x86-x64 or entirely removed). We can even use this difference to write polyglot machine code that runs differently when decoded in 32-bit vs. 64-bit mode, or even distinguish all 3 mode sizes.
x86 machine code is a byte stream that's not self-synchronizing (e.g. a ModRM or an immediate can be any byte). The CPU always knows where to start decoding from, either a jump target or the byte after the end of a previous instruction. That's the start of the instruction (including prefixes).
Bytes in memory are just bytes, only becoming instructions when they're decoded by the CPU. (Although in normal programs, simply disassembling from the top of the .text
section does give you the program's instructions. Self-modifying and obfuscated code are not normal.)
AVX / AVX-512: multi-byte prefixes that overlap with opcodes
Multi-byte VEX and EVEX prefixes aren't that simple in 32-bit mode. For example VEX prefixes overlap with invalid encodings of LES and LDS in modes other than 64-bit. (The c4
and c5
opcodes for LES and LDS are always invalid in 64-bit mode, except as VEX prefixes.) https://wiki.osdev.org/X86-64_Instruction_Encoding#VEX.2FXOP_opcodes
In legacy / compat modes, there weren't any free bytes left that weren't already opcodes or prefixes when AVX (VEX prefixes) and AVX-512 (EVEX prefix), so the only room for extensions was as encodings for opcodes that are only valid with a limited set of ModRM bytes. (e.g. LES / LDS require a memory source, not register - this is why some bits are inverted in VEX prefixes, so the top 2 bits of the byte after c4
or c5
will always be 1
in 32-bit mode instead of 0
.
That's the "mode" field in ModRM, and 11
means register).
(Fun fact: VEX prefixes are not recognized in 16-bit real mode, apparently because some software used the same invalid encodings of LES / LDS as intentional traps, to be sorted out in the #UD exception handler. VEX prefixes are recognized in 16-bit protected mode, though.)
AMD64 freed up several bytes by removing instructions like AAM, as well as LES/LDS (and the one-byte inc
/dec reg
encodings for use as REX prefixes), but CPU vendors have continued to care about 32-bit mode and not added any extensions that are only available in 64-bit mode which could simply take advantage of those free opcode bytes. This means finding ways to cram new instruction encodings into increasingly small gaps in 32-bit machine code. (Often via mandatory prefixes, e.g. rep bsr
= lzcnt
on CPUs with that feature, which gives different results.)
So the decoders in modern CPUs that support AVX / BMI1/2 have to look at multiple bytes to decide whether this is a prefix for a valid AVX or other VEX-encoded instruction, or in 32-bit mode if it should decode as LES or LDS. (And I guess look at the rest of the instruction to decide if it should #UD).
But modern CPUs are looking at 16 or 32 bytes at a time anyway to find instruction boundaries in parallel. (And then later feed those groups of instruction bytes to actual decoders, again in parallel.) https://www.realworldtech.com/sandy-bridge/4/
Same goes for the prefix scheme used by AMD XOP, which is a lot like VEX.
Agner Fog's blog article Stop the instruction set war from 2009 (soon after AVX was announced, before the first hardware supporting it) has a table of remaining unused coding space for future extensions, and some notes about it being "assigned" to AMD, Intel, or Via.
Related / examples
Machine code tricks: decoding the same byte multiple ways
(This is not really related to prefixes, but in general seeing how the rules apply to weird cases can help understand exactly things work.)
A software disassembler does need to know a start point. This can be problematic if obfuscated code mixes code and data, and actual execution jumps to places you wouldn't get if you just assume that you can decode in order without following jumps.
Fortunately compiler-generated code doesn't do that so naive static disassembly (e.g. by objdump -d
or ndisasm
, as opposed to IDA) finds the same instruction boundaries that actually running the program will.
This is not a problem for running obfuscated machine code; the CPU just does what it's told, and never cares about bytes before the place you tell it to jump to. Disassembling without running / single-stepping the program is the hard thing, especially with the possibility of self-modifying code and jumps to what a naive disassembler would think was the middle of an earlier instruction.
Obfuscated machine code can even have an instruction decode one way, then jump back into what was the middle of that instruction, for a later byte to be the opcode (or prefix + opcode). Modern CPUs with uop caches or that mark instruction boundaries in I-cache run slow (but correctly) if you do this, so it's more of a fun code-golf trick (extreme code-size optimization at the expense of speed) or obfuscation technique.
For an example of this, see my codegolf.SE x86 machine code answer to Golf a Custom Fibonacci Sequence. I'll excerpt the disassembly that lines up with what the CPU sees after looping back to cfib.loop
, but note that the first iteration decodes differently. So I'm using just 1 byte outside the loop instead of 2 to effectively jump into the middle for the start of the first iteration. See the linked answer for a full description and the other disassembly.
0000000000401070 <cfib>:
401070: eb .byte 0xeb # jmp rel8 consuming the 01 add opcode as a rel8
0000000000401071 <cfib.loop>:
401071: 01 d0 add eax,edx
# loop entry point on first iteration, jumping over the ModRM byte (D0) of the ADD
(entry on first iteration):
401073: 92 xchg edx,eax
401074: e2 fb loop 401071 <cfib.loop>
401076: c3 ret
You can do this with opcodes that consume more later bytes, like 3D <dword> cmp eax, imm32
. When the CPU sees a 3D
opcode byte, it will grab the next 4 bytes as the immediate. If you later jump into those 4 bytes, they'll be considered as prefix/opcodes and everything will work (except for performance problems) the same regardless of how those bytes had previously been decoded as a different part of an instruction. The CPU has to maintain the illusion of decoding and executing 1 instruction at a time, other than performance.
I learned of this trick from @Ira Baxter's answer on Can assembled ASM code result in more than a single possible way (except for offset values)?
0f
which is not a prefix. So the decoder can tell when the opcode starts. Alas, Intel reused the prefixes to change meaning to opcodes. The decoder takes that into account. E.g.0f 58
is the opcode for alladd{ps,pd,ss,sd}
, specificallyf2 0f 58
isaddsd
while66 0f 58
isaddpd
. Curiously,f2 66 0f 58
iso16 addsd
and notrepne addpd
. – Motelf2
,f3
, and66
as giving additional opcode bits as opposed to being real prefixes. For SIMD instructions, there are two such opcode bits (encoding at most one off2
/f3
/66
) but for scalar instructions,66
can be combined withf2
andf3
. There are even instructions combining a66
prefix withREX.W
. – Triciaf2 66 0f 58
would be an undefined instruction and notaddsd
(which opcode is ufficiallyf2 0f 58
). The fact that the66
byte could be reordered and ignored means that they are still treated as prefixes. I see what you mean though, and I agree it's better to think of them as encoding additional bits. After all, it's only going to get worse :D – Motel66 0f 58 /r
isaddpd
and with both prefixes present, it could also reasonably be decoded as that. Yes, of course they are prefixes, but they should be seen as providing additional opcode bits (as opposed to modifying the instruction in a systematic manner as e.g. segment prefixes do). – Tricia