x86 instruction prefix decoding
Asked Answered
C

2

5

I'm currently in the process of developing a disassembler for the x86_x64 CISC. I have 2 questions regarding prefix instruction decoding:

  1. For the following stream:

    \x9b\x9b\xd9\x30
    

    GCC and objdump outputs

    fstenv [eax]
    

    So they're first reading all prefixes (no more than 15) and then proceed to check the correct instruction using the last prefix read \x9b with \xd9 to make it a fstenv instruction.

    Capstone on the other hand outputs

    wait
    wait
    fnstenv dword ptr [eax] 
    

    Now, obviously capstone in on the wrong that it puts 2 wait instructions and not just 1. But should it put wait instructions at all or GCC and objdump is on the right here for consuming all the extra \x9b prefixes for the fstenv instruction?

  2. For the following stream:

    \xf2\x66\x0f\x12\x00
    

    GCC and objdump output

    data16 movddup xmm0,QWORD PTR [eax]
    

    So they're arranging the prefixes in a specific order so \x66 is interpreted before \xf2 thus, and so they're still using the last prefix read \xf2 to determine the instruction movddup. So is they're right here for using this arrange logic of the prefixes or are they wrong?

    Capstone on the other hand outputs

    movlpd xmm0, qword ptr [eax]

    So they're not arranging the prefixes in any order and they're just taking the last prefix read \x66 to determine the instruction movlpd which looks more logical in this case than what GCC and objdump were doing.

How is the cpu actually interpreting these streams?

Canard answered 26/2, 2019 at 12:57 Comment(6)
1) fstenv does not really exist. The instruction set reference says: "The assembler issues two instructions for the FSTENV instruction (an FWAIT instruction followed by an FNSTENV instruction), and the processor executes each of these instructions separately." Capstone is technically correct.Denial
There is a mandatory order for prefixes. I forgot which one it is, but you can check the manual. If the order of prefixes is wrong, behaviour is undefined.Limber
The instruction set reference says: "Instruction prefixes [...] may be placed in any order relative to each other." Note that wait is not a prefix.Denial
2) It's undefined to apply F2 prefix to that instruction so it's hard to argue which version is correct. My cpu (amd ryzen 1700) seems to think objdump is right, it is executed as movddup. TBH, I expected it to be a movlpd...Denial
Intel E5-2620 also thinks it's movddup.Denial
@Denial This answer claims that for SSE instructions, F2/F3 beat 66 if both groups appear.Limber
G
5

How your CPU actually interprets these streams can be tested relatively easily.


For the first stream, you can use my tool nanoBench. You can use the command

sudo ./nanoBench.sh -asm_init "mov RAX, R14" -asm ".byte 0x9b, 0x9b, 0xd9, 0x30".

This command first sets RAX to a valid memory address, and then runs your stream multiple times. On my Core i7-8700K, I get the following output (for the fixed-function performance counters):

Instructions retired: 3.00
Core cycles: 73.00
Reference cycles: 62.70

We can see that the CPU executes three instructions, so Capstone seems to be correct.


You can analyze the second stream using the debug mode of nanoBench:

sudo ./nanoBench.sh -unroll 1 -asm "mov RAX, R14; mov qword ptr [RAX], 1234; .byte 0xf2, 0x66, 0x0f, 0x12, 0x00" -debug.

This will - inside gdb - first execute the asm code, and then generate a breakpoint trap. We can now look at the current value of the XMM0 register:

(gdb) p $xmm0.v2_int64
$1 = {1234, 1234}

So the high and the low quadword of XMM0 now have the same value as the memory at address RAX, which indicates that the CPU executed the movddup instruction.


You can also analyze the second stream without using nanoBench. To do this, you can save the following assembler code in a file asm.s.

.intel_syntax noprefix

.global _start
_start:
    mov RAX, RSP
    mov qword ptr [RAX], 1234   
    .byte 0xf2, 0x66, 0x0f, 0x12, 0x00
    int 0x03 /* breakpoint trap */

Then, you can build it using

as asm.s -o asm.o
ld -s asm.o -o asm

Now you can analyze it with gdb using gdb ./asm:

(gdb) r
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000400088 in ?? ()
(gdb) p $xmm0.v2_int64
$2 = {1234, 1234}
Guidry answered 26/2, 2019 at 15:9 Comment(3)
testing it with another disassembler doesn't gonna help if it should be movddup or movlpd it can be wrong as-well by arranging the prefixes doesn't it ?Canard
@Canard the last example judges movddup vs movlpd based on the actual value in xmm0 register, not by what disassembler shows. The movlpd would produce different result in register.Elusive
Worth mentioning that Linux enters user-space with XMM registers all zeroed, so we'd see {0, 1234} if it decoded as movlpd.Electrolier
E
3

9B 9B D9 30 Capstone is correct, and objdump's fstenv is also mostly correct.

fstenv isn't a real machine instruction, it's a pseudo-instruction for fwait + fnstenv. Notice that machine code for fnstenv listed in the manual entry is D9 /6, while fstenv adds a 9B before that.

9B is not an instruction prefix, it's a separate 1-byte instruction called wait aka fwait. On original 8086+8087, this was necessary because 8087 was a truly separate coprocessor. How did the 8086 interface with the 8087 FPU coprocessor?. See the comments under the top answer there; before 286 they weren't tightly coupled enough for the main CPU to know if there were pending FPU exceptions.

I'm not sure of the details, but fnstsw on an 8086 / 186 could maybe read an old version of the status word that didn't have the latest flags set from a masked exception. Or maybe it only matters with unmasked exceptions, for getting the FP exception from a multiply or whatever before the fnst* instruction. According to Stephen Kitt's comments, 286 and newer "checks its TEST line before executing an NPX instruction", automatically FWAITing.

And of course CPUs with integrated FPUs have no trouble with precise FP exceptions, and synchronous behaviour, so fwait is a waste of space there.


Capstone's wait / wait / fnstenv dword ptr [eax] is thus more explicit, because as far as the CPU is concerned, it really is 3 instructions. (As Andreas's answer shows modern x86 perf counters record).

Objdump treats two preceding fwait instructions as part of a single fstenv. It would be more accurate to decode it as fwait ; fstenv dword ptr [eax] because Intel's manual only documents fstenv as including a single fwait opcode. But an extra fwait has no architectural effect.


Part 2

As Andreas's answer shows, f2 66 0f 12 00 decodes as a movddup (64-bit broadcast) on real hardware, with a meaningless 66 (data16 operand-size) prefix. objdump is correct, at least for that CPU.

The documented encoding for movddup is F2 0F 12, where F2 is a mandatory prefix, and 0F is the escape byte.

We might have expected it to decode as 66 0F 12 /r MOVLPD with a meaningless F2 REP prefix, but that's not the case; capstone is wrong. There are rules for mandatory prefix bytes: order for encoding x86 instruction prefix bytes including "the 66 prefix is ignored if either F2 or F3 are used".

I'm not 100% sure this sequence is guaranteed to decode as movddup on all hardware, of if this is merely how Intel Sandybridge-family happens to decode it. As @fuz commented, there is a required order for mandatory prefixes and getting it wrong gives undefined behaviour (i.e. a specific CPU might decode it to anything, especially some future CPU where a different sequence of prefixes is mandatory for some other instruction.)

Electrolier answered 26/2, 2019 at 22:54 Comment(2)
Did you find which section of the manual describes the ordering rules? As fuz also complained in comments under the linked question, I can't find anything about mandatory order in the manual version I have either.Denial
@Jester: No, I made that part up / parroted \@fuz. >.< I didn't go digging in the manual.Electrolier

© 2022 - 2024 — McMap. All rights reserved.