Are machine code instructions fetched in little endian 4-byte words on an Intel x86-64 architecture?

Asked 2/7, 2021 at 18:3 Answered 2/7, 2021 at 18:42

Solved x86-64 intel endianness cpu-word machine-code

Despite a common definition for word (as stated on Wikipedia) being:

The largest possible address size, used to designate a location in memory, is typically a hardware word (here, "hardware word" means the full-sized natural word of the processor, as opposed to any other definition used).

x86 systems, according to some sources, note it's treated as 16 bits:

In the x86 PC (Intel, AMD, etc.), although the architecture has long supported 32-bit and 64-bit registers, its native word size stems back to its 16-bit origins, and a "single" word is 16 bits. A "double" word is 32 bits. See 32-bit computer and 64-bit computer.

Yet Intel's official documentation (sdm vol 2, section 1.3.1) states:

this means the bytes of a word are numbered starting from the least significant byte. Figure 1-1 illustrates these conventions.

and Figure 1-1 shows 4 bytes in little endian sequence, not 2 bytes or 8 bytes (as the varying definition by sources linked above would suggest) of word in the x86-64 context:

And where my confusion really lies about all this is how instructions are fetched and parsed. I'm writing an emulator and once I parse a PE formatted executable and get to the text section, if I'm to follow the 4-byte little endian format, doesn't that mean the 4th byte would be parsed first?

Let's make up some bytes for example:

.text segment buffer:
< 0x10, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x20 > ....

Would I parse the first instruction as 1C, 1B, 1A, 10, 20, 1F, 1E, 1D ... (and so on, being variable length there's obviously potentially more words to read depending on what the real bytes are here)?

Painting answered 2/7, 2021 at 18:3 Comment(1)

Re: 16-bit "word" on x86, see the update to my answer. It's just terminology, at this point completely unrelated to the concept of "machine word". – Cankered 2/7, 2021 at 19:51

No, x86 machine code is a byte-stream; there's nothing word-oriented about it, except for 32-bit displacements and immediates which are little-endian. e.g. in add qword [rdi + 0x1234], 0xaabbccdd. It's physically fetched in 16-byte or 32-byte chunks on modern CPUs, and split on instruction boundaries in parallel to feed to decoders in parallel.

48    81   87     34 12 00 00    dd cc bb aa       
REX.W add ModRM    le32 0x1234    le32 0xaabbccdd le32 (sign-extended to 64-bit)

   add    QWORD PTR [rdi+0x1234],0xffffffffaabbccdd

x86-64 is not a word-oriented architecture; there is no single natural word-size, and things don't have to be aligned. That concept is not very useful when thinking about x86-64. The integer register width happens to be 8 bytes, but that's not even the default operand-size in machine code, and you can use any operand-size from byte to qword with most instructions, and for SIMD from 8 or 16 byte up to 32 or 64 byte. And most importantly, alignment of wider integers isn't required in machine code, or even for data.

Some people like to fit a square peg into a round hole and describe x86 in terms of machine-words, but that concept only really fits well for RISC ISAs that are designed around a single word size. (Fixed instruction length, register size, and even data memory load/store is required to be word aligned for word-sized accesses on some RISCs, although modern ones often allow unaligned load/store with some performance penalty.)

(To be fair, 64-bit RISCs are usually also equally efficient with 32 and 64-bit integers. But unlike x86 they can't do add ax, cx that avoids propagating carry into the higher bits of a register. Although RISCs can do a 16-bit store after some math on sign-extending or zero-extending load results).

Are there any modern CPUs where a cached byte store is actually slower than a word store? x86 byte / unaligned word/dword store is more efficient than on many RISCs.

according to some sources, note it's treated as 16 bits:

Yes, in x86 terminology / documentation, a "word" is 16 bits, because modern x86-64 evolved out of 8086 and it would have been silly to change the meaning of a term in the documentation everyone had been using for years when 386 was released. Hence paddw packed add of 16-bit SIMD elements, and movsw/stosw/etc. string instructions.

An x86 16-bit "word" has absolutely zero connection to the concept of a "machine word" in CPU architecture.

On 8086 through 286, 16-bit was the register and bus width, and the only integer operand-size other than byte you can use for most ALU instructions. But those CPUs were still very much not based around "words" the way MIPS is; The machine-code format was still the same, with unaligned little-endian 16-bit immediates and displacements. (8088 was identical to 8086, except for the 8-bit bus-interface and 4-byte instruction prefetch buffer instead of 6-byte.)

Cankered answered 2/7, 2021 at 18:42 Comment(6)

I realize the answer to this comment is fairly unlikely considering how microcode that employs these things is usually proprietary. But... Would you happen to have a reference to some literature that covers the process of parsing an x86-64 machine code instruction? I've been drafting and carefully documenting a question about how it can be done what-with all the optional parameters (instruction prefix, modr/m, SIB, displacement, immediate). It's an interesting challenge and I've yet to find anyone covering the topic. – Painting 2/7, 2021 at 19:49

@J.Todd: You mean how hardware does it in parallel? The observable performance quirks (agner.org/optimize) shed some light on things, e.g. length-changing prefixes like in add cx, 0x1234, the same opcode without a 66 prefix would imply a longer instruction with a 4-byte immediate. And github.com/travisdowns/uarch-bench/wiki/… / Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions? - it seems instructions get routed to decoders using a conservative check for being 1 uop – Cankered 2/7, 2021 at 19:54

@J.Todd: But if you mean how to write a software decoder, that's pretty well documented in Intel's official manuals, in terms of recording which prefixes you've seen before you get to an opcode, then the opcode implies presence or absences of modrm and/or immediate. ModRM implies presence/absence of SIB and/or displacement. A bit of extra complication for instructions with "mandatory prefixes", where a set of prefixes seen before the opcode mean it's a different instruction. e.g. F3 0F 58 /r addss felixcloutier.com/x86/addss – Cankered 2/7, 2021 at 19:57

@J.Todd: That said, writing a disassembler / decoder that matches what the CPU does in every case is non-trivial. See Christopher Domas's talk at BlackHat 2017, Breaking the x86 Instruction Set for practical ways of brute-force testing what real CPUs do. He says he found bugs in every existing disassembler. – Cankered 2/7, 2021 at 20:0

Often I find the process of creating a well explained SO question answers my question. I'm not finished reading the Intel manual on the process, perhaps it'll all come together as I finish reading and citing the info I find confusing. – Painting 2/7, 2021 at 20:3

And yes on your third comment, I believe I linked that talk to you in a comment recently. Although I'm confident you'd seen it before. – Painting 2/7, 2021 at 20:10

No, x86 instructions are parsed as a sequence of bytes, not as a longer word. In your example, the first instruction is the bytes 0x10 0x1a which decodes to adc [rdx], bl. It is not 0x1c 0x1b which would decode to sbb al, 0x1b nor 0x20 0x1f which would be and [rdi], bl

However, when an instruction contains a multibyte number (16/32/64 bits) as an immediate operand, displacement, address, etc, then that number is encoded little-endian. For example, add ecx, 0x12345678 is encoded 0x81 0xc1 0x78 0x56 0x34 0x12.

Magnanimous answered 2/7, 2021 at 18:41 Comment(0)

Recommended topics

Hot tags