Weird data sizes?

Asked 13/9, 2015 at 9:53 Answered 13/9, 2015 at 13:57

1 Byte = 8bits
1 Halfword = 16bits = 2 Bytes
1 Word = 32 bits = 4 Bytes
1 Long = 64 bits = 8 Bytes

But then in x86 Assembly (AT&T syntax), I use movw (move word) to move a Halfword to a 16 bit register, movl (move long) to move a word to a 32 bits register. I'm using a virtual machine on a 64 bits OS running a 32 bits OS.

What am I doing wrong?

Does that mean in the OS running in my VM the sizes are:

1 Byte = 4bits
1 Halfword = 8bits = 2 Bytes
1 Word = 16 bits = 4 Bytes
1 Long = 32 bits = 8 Bytes

I checked on GDB the sizes and I think they were:

1 Byte = 8bits
1 Halfword = 16bits = 2 Bytes
1 Word = 32 bits = 4 Bytes
1 Long = 64 bits = 8 Bytes

Redeeming answered 13/9, 2015 at 9:53 Comment(1)

Another duplicate: What's the size of a QWORD on a 64-bit machine? and also related: What comes after QWORD? – Graupel 28/2, 2021 at 2:24

The term word size, or machine word, usually refers to the size of a register, and the size of a native load/store. The wikipedia article mentions some of the same stuff I wrote in this answer.

For a 64-bit system, a word could mean 8 bytes, but yes it's common for 64-bit RISC machines to use word = 32-bit. Most of them evolved out of 32-bit RISC ISAs, so it's natural to keep the same terminology and call 64-bits a double-word.

(Note that GDB uses its own notion of what a "word" is, separate from the ISA.)

But x86 evolved out of 16-bit 8086, where word = 16-bit. When x86 was extended to have a 32-bit mode (i386), the simplest choice for everyone was to keep the same names for everything. An x86 dword is still 32 bits, an x86 word is still 16 bits. Even original 8086 + 8087 could load and store dword and qword integers, floats, and doubles, and instructions like cwd (sign extend word to dword) existed in 8086 to set up for idiv, so these terms were already in full use before 386 extended the register width to dword.

Also note that renaming everything would have been really confusing, because when 386 was new, most of them were still used in 16-bit mode to run DOS programs. Even modern x86-64 CPUs have full support for running in 16-bit real mode, so it would have been very confusing to have word mean different things in different parts of Intel's manuals.

Byte is always an octet of 8 bits, except in some historical computer architectures. There were some with 9 bit bytes. The C standard still doesn't require CHAR_BIT = 8, so to write fully portable code, you can't assume that or 2's complement signed integers.

So in x86 documentation and asm mnemonics / syntax:

B = Byte = 8bits (PADDB add packed 8bit ints in a vector)
W = word = 16bits (PADDW add packed 16bit ints in a vector)
D = long or dword (double-word) = 32bits (PADDD add packed 32bit ints in a vector)
Q = quad-word = 64bits (PADDQ add packed 64bit ints in a vector)
DQ = double-quad (also sometimes oct-word) = 128b (movdqa copy aligned 128b. PUNPCKLQDQ: interleave the Low two 64bit Qwords of 128b src and dest into the DQ dest.)

AVX movdqa ymm0, [rdi] is a 32B load, even though it still uses the same mnemonic. AVX is more like multiple 128b lanes than real native 256b vectors, so this kind of justifies it.

In NASM syntax, syntax like mov ax, word ptr [rdi] is sometimes needed to specify the operand size, instead of inferring it from the dest register. AT&T syntax uses suffixes on mnemonics to specify operand size, if you don't want to leave it implicit and inferred from the choice of register: movw (%rdi), %ax.

The B/W/D things in mnemonics predates vector extensions, in string-move instructions as one example. STOS does *(rdi+=size) = al/ax/eax/rax. It can be written with an operand, like
STOS byte pointer [RDI] to tell the assembler what operand size version to encode. But even in Intel / MASM / NASM syntax, you can also write STOSB / STOSW / STOSD / STOSQ.

x86 is very much not a word-oriented architecture.

The whole concept of a "machine word" doesn't fit well for x86. 32-bit-only P5 Pentium CPUs have guaranteed-atomic loads/stores up to 64-bit. (e.g. with x87 or MMX), even though the integer register width is only 32-bit. (A 64-bit CAS requires lock cmpxchg8b in 32-bit mode).

With x86-64, support for SSE2 is guaranteed, so we have 16-byte vector registers, and efficient support for basically every integer instruction with 8, 16, 32, or 64-bit operand-size. (With 32-bit operand-size being the default in x86-64 machine code (requiring no extra prefixes) so it's most efficient for code-size and sometimes also performance other than that, e.g. for div or imul on some CPUs.)

Also, unaligned loads and stores are fully efficient, not even an extra cache RMW cycle to commit unaligned or byte stores to L1d cache, as long as they don't cross a cache-line boundary. And the instruction format is a byte stream, not aligned words.

So it's not very meaningful to say that modern x86-64 has any specific "word size". The concept doesn't fit x86-64 as an ISA, and certainly not actual modern microarchitectures with their efficient unaligned loads/stores.

Graupel answered 13/9, 2015 at 13:57 Comment(0)

-2

In x86 a word is always 16 bits:

1 Byte = 8 bits
1 Word = 16 bits = 2 Bytes
1 Dword (long) = 32 bits = 8 Bytes
1 Qword = 64 bits = 16 Bytes

In GDB/real size (on a 32 bits computer):

1 Byte = 8bits
1 Halfword = 16bits = 2 Bytes
1 Word = 32 bits = 4 Bytes
1 Giant (long) = 64 bits = 8 Bytes

Intel "messed up" on the word sizes because of 16 bit processors.

Redeeming answered 13/9, 2015 at 10:41 Comment(3)

See Peter Cordes' answer. Your "GDB" list with Halfword makes no sense for Intel and compatibles.And for this, it does not matter if you have a 16 bit, 32 bit or 64 bit processor. Intel did not "mess up". I guess you are somehow thinking of machine words. These play no role here. BYTE, WORD etc. have fixed meanings and sizes. – Capp 14/9, 2015 at 16:0

Yeah, you're right I thought a 32 bit PC was supposed to have a 32 bit word for everything (read quotes below). What I meant by Intel "messed up" is that they chose to make it 16 bits so it was compatible with some old processors but I guess I was wrong. Thank you. "the majority of the registers in a processor are usually word sized and the largest piece of data that can be transferred to and from the working memory in a single operation is a word in many (not all) architectures." - On Wikipedia about words – Redeeming 15/9, 2015 at 9:58

ISTM you are confusing "machine word" with the WORD type. – Capp 15/9, 2015 at 16:58

x86 is very much not a word-oriented architecture.

Recommended topics

Hot tags