All three of the earlier answers are wrong in different ways.
The accepted answer by Margaret Bloom implies that partial register stalls are to blame. Partial register stalls are a real thing, but are unlikely to be relevant to GCC's decision here.
If GCC replaced mov edx,3
by mov dl,3
, then the code would just be wrong, because writes to byte registers (unlike writes to dword registers) don't zero the rest of the register. The parameter in rdx
is of type size_t
, which is 64 bits, so the callee will read the full register, which will contain garbage in bits 8 to 63. Partial register stalls are purely a performance issue; it doesn't matter how fast the code runs if it's wrong.
That bug could be fixed by inserting xor edx,edx
before mov dl,3
. With that fix, there is no partial register stall, because zeroing a full register with xor
or sub
and then writing to the low byte is special-cased in all CPUs that have the stalling problem. So partial register stalls are still irrelevant with the fix.
The only situation where partial register stalls would become relevant is if GCC happened to know that the register was zero, but it wasn't zeroed by one of the special-cased instructions. For example, if this syscall was preceded by
loop:
...
dec edx
jnz loop
then GCC could deduce that rdx
was zero at the point where it wants to put 3 in it, and mov dl,3
would be correct – but it would be a bad idea in general because it could cause a partial-register stall. (Here, it wouldn't matter because syscalls are so slow anyway, but I don't think GCC has a "slow function that there's no need to speed-optimize calls to" attribute in its internal type system.)
Why doesn't GCC emit xor
followed by a byte move, if not because of partial register stalls? I don't know but I can speculate.
It only saves space when initializing r0
through r3
, and even then it only saves one byte. It increases the number of instructions, which has its own costs (the instruction decoders are frequently a bottleneck). It also clobbers the flags unlike the standard mov
, which means it isn't a drop-in replacement. GCC would have to track a separate flag-clobbering register initialization sequence, which in most cases (11/15 of possible destination registers) would be unambiguously less efficient.
If you're aggressively optimizing for size, you can do push 3
followed by pop rdx
, which saves 2 bytes regardless of the destination register, and doesn't clobber the flags. But it is probably much slower because it writes to memory and has a false read-write dependence on rsp
, and the space savings seem unlikely to be worth it. (It also modifies the red zone, so it isn't a drop-in replacement either.)
supercat's answer says
Processor cores often include logic to execute multiple 32-bit or 64-bit instructions simultaneously, but may not include logic to execute an 8-bit operation simultaneously with anything else. Consequently, while using 8-bit operations on the 8088 when possible was a useful optimization on the 8088, it can actually be a significant performance drain on newer processors.
Modern optimizing compilers actually use 8-bit GPRs quite a lot. (They use 16-bit GPRs relatively rarely, but I think that's because 16-bit quantities are uncommon in modern code.) 8-bit and 16-bit operations are at least as fast as 32-bit and 64-bit operations at most execution stages, and some are faster.
I previously wrote here "As far as I know, 8-bit operations are as fast as, or faster than, 32/64-bit operations on absolutely every 32/64 bit x86/x64 processor ever made." But I was wrong. Quite a few superscalar x86/x64 processors merge 8- and 16-bit destinations into the full register on every write, which means that write-only instructions like mov
have a false read dependency when the destination is 8/16 bits which doesn't exist when it's 32/64 bits. False dependency chains can slow execution if you don't clear the register before every move (or during, using something like movzx
). Newer processors have this problem even though the earliest superscalar processors (Pentium Pro/II/III) didn't have it. In spite of that, modern optimizing compilers do use the smaller registers in my experience.
BeeOnRope's answer says
The short answer for your particular case, is because gcc always sign or zero-extends arguments to 32-bits when calling a C ABI function.
But this function has no parameters shorter than 32 bits in the first place. File descriptors are exactly 32 bits long, and size_t
is exactly 64 bits long. It doesn't matter that many of those bits are often zero. They aren't variable-length integers that are encoded in 1 byte if they're small. It would only be correct to use mov dl,3
, with the rest of rdx
possibly being nonzero, for a parameter if there was no integer promotion requirement in the ABI and the actual parameter type was char
or some other 8-bit type.
3
and1
, which are alreadysigned int
, remain assigned int
. – Urania3
will either remain assigned int
or be converted tosize_t
depending on whether there's a prototype. – Uraniaxor eax, eax
indicates that the call was made without a prototype in scope. It doesn't know whether the function is varargs or not, so it sets AL to 0 indicate 0 arguments passed in SSE registers. Your weird case is really an ABI question, the "as if" rule allows either implementation so long as both ends agree on it. – Urania