Why doesn't GCC use partial registers?

Asked 10/1, 2017 at 16:23 Answered 23/8, 2021 at 23:7

Solved assembly gcc x86 x86-64 cpu-architecture

Disassembling write(1,"hi",3) on linux, built with gcc -s -nostdlib -nostartfiles -O3 results in:

ba03000000     mov edx, 3 ; thanks for the correction jester!
bf01000000     mov edi, 1
31c0           xor eax, eax
e9d8ffffff     jmp loc.imp.write

I'm not into compiler development but since every value moved into these registers are constant and known compile-time, I'm curious why doesn't gcc uses dl, dil, and al instead. Some may argue that this feature won't make any difference in performance but there's a big difference in executable size between mov $1, %rax => b801000000 and mov $1, %al => b001 when we are talking about thousands of register accesses in a program. Not only small size if part of a software's elegance, it does have effect on performance.

Can someone explain why did "GCC decide" that it doesn't matter?

Yclept answered 10/1, 2017 at 16:23 Comment(9)

if you only load partial registers the rest will contain random garbage and the callee will use the whole register (as appropriate for the data type). Also it causes partial register stalls. Note that writing the low 32 bits will however zero the top 32 bits automatically. PS: you disassembled wrong, all of those instructions are actually 32 bit (no rex prefix). – Cassell 10/1, 2017 at 16:28

It doesn't have anything to do with GCC, every C compiler is required to do this. Google "C integer promotion" to learn more. – Colobus 10/1, 2017 at 16:58

@HansPassant Does Integer promotion work for function arguments of prototyped functions? As far I can tell from the standard only the default argument promotions apply for function calls. Quoting: "The integer promotions are applied only: as part of the usual arithmetic conversions, to certain argument expressions [ndr: the default arg promotions of above], to the operands of the unary +, -, and ~ operators, and to both operands of the shift operators, as specified by their respective subclauses" – Omnipotence 10/1, 2017 at 17:6

@MargaretBloom The value passed an argument is converted by assignment to the type of argument. See paragraph 7. Either way this means that the constants 3 and 1, which are already signed int, remain as signed int. – Urania 10/1, 2017 at 17:51

@RossRidge Yes, but does assignment performs integer promotion? From my understanding, the answer seems to be no. – Omnipotence 10/1, 2017 at 17:55

@MargaretBloom No, but I don't think quibbling about where Hans Passant was using the correct terminology is all that helpful here. I should correct my last comment though, the constant 3 will either remain as signed int or be converted to size_t depending on whether there's a prototype. – Urania 10/1, 2017 at 18:2

@RossRidge Of course, that was just total nitpicking in this context. I've asked because I stumbled on this weird case and I'm considering if it's worth asking about. – Omnipotence 10/1, 2017 at 18:7

@MargaretBloom For what its worth the xor eax, eax indicates that the call was made without a prototype in scope. It doesn't know whether the function is varargs or not, so it sets AL to 0 indicate 0 arguments passed in SSE registers. Your weird case is really an ABI question, the "as if" rule allows either implementation so long as both ends agree on it. – Urania 10/1, 2017 at 18:17

@RossRidge That's a good point. – Omnipotence 10/1, 2017 at 18:20

Yes, GCC generally avoids writing to partial registers, unless optimizing for size (-Os) instead of purely speed (-O3). Some cases require writing at least the 32-bit register for correctness, so a better example would be something like:

char foo(char *p) { return *p; } compiles to movzx eax, byte ptr [rdi]
instead of mov al, [rdi]. https://godbolt.org/z/4ca9cTG9j

But GCC doesn't always avoid partial registers, sometimes even causing partial-register stalls https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533

Writing partial registers entails a performance penalty on many x86 processors because they are renamed into different physical registers from their whole counterpart when written. (For more about register renaming enabling out-of-order execution, see this Q&A).

But when an instruction reads the whole register, the CPU has to detect the fact that it doesn't have the correct architectural register value available in a single physical register. (This happens in the issue/rename stage, as the CPU prepares to send the uop into the out-of-order scheduler.)

It's called a partial register stall. Agner Fog's microarchitecture manual explains it pretty well:

6.8 Partial register stalls (PPro/PII/PIII and early Pentium-M)

Partial register stall is a problem that occurs when we write to part of a 32-bit register and later read from the whole register or a bigger part of it.
Example:

; Example 6.10a. Partial register stall
mov al, byte ptr [mem8]
mov ebx, eax ; Partial register stall

This gives a delay of 5 - 6 clocks. The reason is that a temporary register has been assigned to AL to make it independent of AH. The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX.

Behaviour in different CPUs:

Intel early P6 family: see above: stall for 5-6 clocks until the partial writes retire.
Intel Pentium-M (model D) / Core2 / Nehalem: stall for 2-3 cycles while inserting a merging uop. (see this Q&A for a microbenchmark writing AX and reading EAX with or without xor-zeroing first)
Intel Sandybridge: insert a merging uop for low8/low16 (AL/AX) without stalling, or for AH/BH/CH/DH while stalling for 1 cycle.
Intel IvyBridge (maybe), but definitely Haswell / Skylake: AL/AX aren't renamed, but AH still is: How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent.
All other x86 CPUs: Intel Pentium4, Atom / Silvermont / Knight's Landing. All AMD (and Via, etc): partial registers are never renamed. Writing a partial register merges into the full register, making the write depend on the old value of the full register as an input.

Without partial-register renaming, the input dependency for the write is a false dependency if you never read the full register. This limits instruction-level parallelism because reusing an 8 or 16-bit register for something else is not actually independent from the CPU's point of view (16-bit code can access 32-bit registers, so it has to maintain correct values in the upper halves). And also, it makes AL and AH not independent. When Intel designed P6-family (PPro released in 1993), 16-bit code was still common, so partial-register renaming was an important feature to make existing machine code run faster. (In practice, many binaries don't get recompiled for new CPUs.)

That's why compilers mostly avoid writing partial registers. They use movzx / movsx whenever possible to zero- or sign-extend narrow values to a full register to avoid partial-register false dependencies (AMD) or stalls (Intel P6-family). Thus most modern machine code doesn't benefit much from partial-register renaming, which is why recent Intel CPUs are simplifying their partial-register renaming logic.

As @BeeOnRope's answer points out, compilers still read partial registers, because that's not a problem. (Reading AH/BH/CH/DH can add an extra cycle of latency on Haswell/Skylake, though, see the earlier link about partial registers on recent members of Sandybridge-family.)

Also note that write takes arguments that, for an x86-64 typically configured GCC, need whole 32-bit and 64-bit registers so it couldn't simply be assembled into mov dl, 3. The size is determined by the type of the data, not the value of the data.

Only 32-bit register writes implicitly zero-extend to the full 64-bit; writing 8 and 16-bit partial registers leave the upper bytes unchanged. (This makes it tricky for hardware to handle efficiently, which is why AMD64 didn't follow that pattern.)

Finally, in certain contexts, C has default argument promotions to be aware of, ~~though this is not the case~~.
Actually, as RossRidge pointed out, the call was probably made without a visible prototype.

Your disassembly is misleading, as @Jester pointed out.
For example mov rdx, 3 is actually mov edx, 3, although both have the same effect—that is, to put 3 in the whole rdx.
This is true because an immediate value of 3 doesn't require sign-extension and a MOV r32, imm32 implicitly clears the upper 32 bits of the register.

Omnipotence answered 10/1, 2017 at 17:16 Comment(13)

"a temporary register has been assigned to AL to make it independent of AH". Why assign a "new" register for it, AL isn't physically a "subset" of EAX? Why does it have to be separated from EAX? – Guanabara 10/1, 2017 at 17:31

To make more instructions execute in parallel. It's called register renaming, the Agner Fog manual I linked has more in-depth material than the Wikipedia article. Intel Optimisation manuals also cover this topic. – Omnipotence 10/1, 2017 at 17:39

The above quotation from Agner Fog is for Netburst (Pentium 4). The quoted delay of 5 - 6 clocks is much better on later microarchitectures. For example from Sandy Bridge and Ivy Bridge, The Ivy Bridge inserts an extra μop only in the case where a high 8-bit register (AH, BH, CH, DH) has been modified – Barney 16/3, 2017 at 5:59

Thanks @Olsonist, it is worth mentioning. If you don't mind I'll quote your comment in the answer. – Omnipotence 16/3, 2017 at 10:54

Yes, the simple answer has nothing to do with register stalls (it is easy to find examples of compilers reading and writing the partial registers even at -O2 and -O3), but that the x86 and x86-64 requires that arguments smaller than 32-bits are zero or sign extended when passing them to functions. Regardless of what the prototype is, and even if there is no prototype in scope (the ABI still applies then, with some "default" function signature based on the shape of the call). Interestingly, the high 32-bits can contain garbage, but not bits 8 to 32. – Fornof 17/3, 2017 at 23:12

@Fornof What an interesting question you asked! I had the very same question a bit ago, when I posted this comment above. – Omnipotence 18/3, 2017 at 0:24

Oh yes, I hadn't clicked on it! I wrote an answer below that covers some of the nuances, but it links to two other questions that cover it in a lot of details. What I just discovered while playing around with this question though is that apparently icc doesn't respect the defacto rules, so actually icc and clang are incompatible on Linux and OSX too (I guess icc exists on OSX)? The short version is that while both clang and gcc respect the "zero extend rule", only clang relies on it, as your example shows. gcc zero extends, but doesn't use that fact to optimize... – Fornof 18/3, 2017 at 0:45

BTW, my claim above is a bit too strong, I wrote it before I did the remaining research and found out that it is a bit of a weaker "defacto" standard than I thought, with icc not adhering and some debate about what the right thing to do is. The key to the OPs question though is that gcc does follow the extension-to-32-bits rule. – Fornof 18/3, 2017 at 0:47

I fixed several things that were wrong (the quote is for early P6, not Netburst), and added a table of how different CPUs handle partial registers. Anyway, I wanted to link to a good answer about partial registers in general, and now this is one, IMO :) – Riches 2/12, 2017 at 5:25

@PeterCordes Thank you very much Peter! I really appreciate it – Omnipotence 2/12, 2017 at 12:31

@PeterCordes is there any information on how expensive the merging uop for partial registers on Haswell and newer is? Does it only show up as 1 uop in ROB (so unless ROB bottleneck no cost) or can it cause bottlenecks elsewhere. – Maria 22/9, 2021 at 0:21

@Noah: As discussed in How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent, the only merging on HSW and later is high-8, and it seems that a high-8 merging uop has to issue in a cycle by itself (like Intel documents for SnB, IIRC). So it costs more front-end bandwidth than you'd expect, but otherwise AFAIK it's just 1 uop for the ROB (and probably RS, I forget if it needs a back-end port). I haven't done a lot of experiments with it. – Riches 22/9, 2021 at 0:41

@Noah: Oh, also IIRC, a loop that needs a high-8 merge was supposed to not run from the LSD, before Intel disabled SKL's LSD entirely (because of a correctness bug involving this somehow). Maybe still a thing on HSW or ICL. – Riches 22/9, 2021 at 0:43

All three of the earlier answers are wrong in different ways.

The accepted answer by Margaret Bloom implies that partial register stalls are to blame. Partial register stalls are a real thing, but are unlikely to be relevant to GCC's decision here.

If GCC replaced mov edx,3 by mov dl,3, then the code would just be wrong, because writes to byte registers (unlike writes to dword registers) don't zero the rest of the register. The parameter in rdx is of type size_t, which is 64 bits, so the callee will read the full register, which will contain garbage in bits 8 to 63. Partial register stalls are purely a performance issue; it doesn't matter how fast the code runs if it's wrong.

That bug could be fixed by inserting xor edx,edx before mov dl,3. With that fix, there is no partial register stall, because zeroing a full register with xor or sub and then writing to the low byte is special-cased in all CPUs that have the stalling problem. So partial register stalls are still irrelevant with the fix.

The only situation where partial register stalls would become relevant is if GCC happened to know that the register was zero, but it wasn't zeroed by one of the special-cased instructions. For example, if this syscall was preceded by

loop:
  ...
  dec edx
  jnz loop

then GCC could deduce that rdx was zero at the point where it wants to put 3 in it, and mov dl,3 would be correct – but it would be a bad idea in general because it could cause a partial-register stall. (Here, it wouldn't matter because syscalls are so slow anyway, but I don't think GCC has a "slow function that there's no need to speed-optimize calls to" attribute in its internal type system.)

Why doesn't GCC emit xor followed by a byte move, if not because of partial register stalls? I don't know but I can speculate.

It only saves space when initializing r0 through r3, and even then it only saves one byte. It increases the number of instructions, which has its own costs (the instruction decoders are frequently a bottleneck). It also clobbers the flags unlike the standard mov, which means it isn't a drop-in replacement. GCC would have to track a separate flag-clobbering register initialization sequence, which in most cases (11/15 of possible destination registers) would be unambiguously less efficient.

If you're aggressively optimizing for size, you can do push 3 followed by pop rdx, which saves 2 bytes regardless of the destination register, and doesn't clobber the flags. But it is probably much slower because it writes to memory and has a false read-write dependence on rsp, and the space savings seem unlikely to be worth it. (It also modifies the red zone, so it isn't a drop-in replacement either.)

supercat's answer says

Processor cores often include logic to execute multiple 32-bit or 64-bit instructions simultaneously, but may not include logic to execute an 8-bit operation simultaneously with anything else. Consequently, while using 8-bit operations on the 8088 when possible was a useful optimization on the 8088, it can actually be a significant performance drain on newer processors.

Modern optimizing compilers actually use 8-bit GPRs quite a lot. (They use 16-bit GPRs relatively rarely, but I think that's because 16-bit quantities are uncommon in modern code.) 8-bit and 16-bit operations are at least as fast as 32-bit and 64-bit operations at most execution stages, and some are faster.

I previously wrote here "As far as I know, 8-bit operations are as fast as, or faster than, 32/64-bit operations on absolutely every 32/64 bit x86/x64 processor ever made." But I was wrong. Quite a few superscalar x86/x64 processors merge 8- and 16-bit destinations into the full register on every write, which means that write-only instructions like mov have a false read dependency when the destination is 8/16 bits which doesn't exist when it's 32/64 bits. False dependency chains can slow execution if you don't clear the register before every move (or during, using something like movzx). Newer processors have this problem even though the earliest superscalar processors (Pentium Pro/II/III) didn't have it. In spite of that, modern optimizing compilers do use the smaller registers in my experience.

BeeOnRope's answer says

The short answer for your particular case, is because gcc always sign or zero-extends arguments to 32-bits when calling a C ABI function.

But this function has no parameters shorter than 32 bits in the first place. File descriptors are exactly 32 bits long, and size_t is exactly 64 bits long. It doesn't matter that many of those bits are often zero. They aren't variable-length integers that are encoded in 1 byte if they're small. It would only be correct to use mov dl,3, with the rest of rdx possibly being nonzero, for a parameter if there was no integer promotion requirement in the ABI and the actual parameter type was char or some other 8-bit type.

Custom answered 23/8, 2021 at 23:7 Comment(9)

The question chose a bad example (where at least 32-bit registers are necessary for both args). Margaret is answering the also-interesting general case about why GCC uses movzx eax, byte ptr [rdi] to implement char foo(char *p) { return *p; } and similar cases like that. Or char bar(){ return 1; } godbolt.org/z/4ca9cTG9j shows that at -Os (optimize for size), GCC does just write AL instead of extending into EAX. (And that clang just writes AL, cavalierly risking partial-register false dependencies; the caller can be assumed not to read EAX.) – Riches 23/8, 2021 at 23:51

Re: the 3-byte push/pop: On modern CPUs, the "stack engine" that handles RSP updates by stack instructions in the front-end, during issue/rename, handles the read-write dependence on rsp with zero latency. (But later explicit uses of RSP would then need a stack-sync uop on Intel CPUs). clang -Oz optimizes for size without caring about speed, and actually does use push/pop. – Riches 23/8, 2021 at 23:57

If you need multiple nearby constants, 3-byte lea edx, [rax+1] or w/e can be good. Tips for golfing in x86/x64 machine code. Interesting idea about xor-zero and mov dl, 3, though, for 4 bytes. You're right that being 2 instructions and 2 uops in the front-end might be a bottleneck, for either decode or issue. And takes more space in the uop cache. Also takes more space in the ROB, limiting how far ahead the CPU can see to find ILP for out-of-order exec. – Riches 23/8, 2021 at 23:57

8-bit operations are as fast as, or faster than, 32/64-bit operations on absolutely every 32/64 bit x86/x64 processor ever made - For the ALU part yes, I think that's true. But using 8-bit operand-size means writing partial registers, which means false dependencies on some CPUs. This can make mov al, cl slower than mov eax, ecx in both throughput and even latency (mov elimination only works on 32 and 64-bit mov, or on Intel, on movzx 8->32 bit). e.g. IvyBridge can do 4/clock 32-bit mov (no back-end ALU port needed), but only 3/clock 8-bit mov even if you use different regs. – Riches 24/8, 2021 at 0:6

For 2-input operations like add that isn't a concern, you already need to read-modify the destination so the output dependency isn't false. Things like xor-zeroing are slower on 8-bit registers, though, since they can't be eliminated. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent. And also GCC bug gcc.gnu.org/bugzilla/show_bug.cgi?id=15533 (especially my comment at the end - GCC since 4.4 causes a partial reg stall with xor al,al / read EAX) – Riches 24/8, 2021 at 0:13

@PeterCordes Thanks for all the comments. I didn't realize many processors have false dependencies in 8/16 bit writes. I edited the answer. – Custom 24/8, 2021 at 6:2

Cheers. Definitely some good points in this answer, happy to help out with extra details. – Riches 24/8, 2021 at 6:43

BTW, Margaret's answer did mention the fact that 32-bit register writes were requires in the question's example for correctness, but only in a section at the bottom. I added a useful example to the top of the answer. Your answer is still correct that avoiding partial register stalls isn't the issue here (or for the example I added), although it would be for setting up a char function arg: the function might well copy the whole 32-bit or 64-bit register even if only the low byte has the function arg. – Riches 24/8, 2021 at 6:59

'I don't think GCC has a "slow function that there's no need to speed-optimize calls to" attribute in its internal type system' there is hotness tracking. It happens at least with __attribute__((cold)) on functions, __builtin_expect on branches, and dynamic information collected from PGO. – Fillip 6/2, 2023 at 20:7

-1

On something like the original IBM PC, if AH was known to contain 0 and it was necessary to load AX with a value like 0x34, using "MOV AL,34h" would generally take 8 cycles rather than the 12 required for "MOV AX,0034h"--a pretty big speed improvement (either instruction could execute in 2 cycles if pre-fetched, but in practice the 8088 spends most of its time waiting for instructions to be fetched at a cost of four cycles per byte). On the processors used in today's general-purpose computers, however, the time required to fetch code is generally not a significant factor in overall execution speed, and code size is normally not a particular concern.

Further, processor vendors try to maximize the performance of the kinds of code people are likely to run, and 8-bit load instructions aren't likely to be used nearly as often nowadays as 32-bit load instructions. Processor cores often include logic to execute multiple 32-bit or 64-bit instructions simultaneously, but may not include logic to execute an 8-bit operation simultaneously with anything else. Consequently, while using 8-bit operations on the 8088 when possible was a useful optimization on the 8088, it can actually be a significant performance drain on newer processors.

Acidulent answered 16/3, 2017 at 19:45 Comment(2)

but may not include logic to execute an 8-bit operation simultaneously with anything else I'm not aware of any x86 CPUs where that's the case. The real problem for modern compilers is writing partial registers, which various microarchitectures handle different ways. (These days mainly by merging on the spot, so mov al, whatever has a false dependency on the old value of RAX.) – Riches 24/8, 2021 at 0:16

@PeterCordes: Perhaps I didn't phrase things well, since a false dependency wouldn't necessarily preclude simultaneous operation with unrelated registers, but it's still limits a processor's ability to overlap operations that should be able to overlap. My main point was that if the upper portion of RAX is known to be zero, the fact that MOV AL,12h is smaller than MOV EAX,12h isn't as likely to improve speed so much as the false dependency issues would be to make it slower. – Acidulent 24/8, 2021 at 14:47

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags